Method and apparatus for sociological data mining

ABSTRACT

A processing system for retrieving interrelated documents is described. The system comprises a document repository for storing a plurality of documents, a metadata repository for storing a plurality of metadata elements to represent relations between the documents, and a sociological analysis engine to identify relationships between the documents using the metadata elements from the metadata repository.

FIELD OF THE INVENTION

The present invention relates to electronic documents, and moreparticularly to a method for retrieving a document or (more typically) agroup of documents that satisfies a user-defined criterion or set ofcriteria. Additionally, the invention relates to the detection ofpatterns among these documents, and the various actors operating uponthem.

BACKGROUND

The volume of electronic information in both personal and corporate datastores is increasing rapidly. Examples of such stores include electronicmail (e-mail) messages, word-processed and text documents, contactmanagement tools, and calendars. But the precision and usability ofknowledge management and search technology has not kept pace. The vastmajority of searches performed today are still keyword searches orfielded searches. A keyword search involves entering a list of words,which are likely to be contained within the body of the document forwhich the user is searching. A fielded search involves locatingdocuments using lexical strings that have been deliberately placedwithin the document (usually at the top) with the purpose offacilitating document retrieval. These data retrieval techniques sufferfrom two fundamental flaws. Firstly, they often result in either vastnumbers of documents being returned, or, if too many keywords orattribute-value pairs are specified and the user specifies that theymust all appear in the document, no documents at all. Secondly, thesetechniques are able only to retrieve documents that individually meetthe search criteria. If two or more related (but distinct) documentsmeet the search criteria only when considered as a combined unit, thesedocuments will not be retrieved. Examples of this would include the casewhere the earlier draft of a document contains one keyword and the laterdraft contains another keyword that is absent from the first document;or an e-mail message and an entry in an electronic calendar, where thecalendar entry might clarify the context of a reference in the e-mailmessage.

SUMMARY OF THE INVENTION

A processing system for retrieving interrelated documents is described.The system comprises a document repository for storing a plurality ofdocuments, a metadata repository for storing a plurality of metadataelements to represent relations between the documents, and asociological analysis engine to identify relationships between thedocuments using the metadata elements from the metadata repository.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings and in whichlike reference numerals refer to similar elements and in which:

FIG. 1 is a block diagram of one embodiment of a network, which may beused with the present invention.

FIG. 2 is a block diagram of one embodiment of a computer system.

FIG. 3 is a block diagram of navigation flow in one embodiment of thepresent invention.

FIG. 4 a is a flowchart of one embodiment of the initial preprocessingstep.

FIG. 4 b is a flowchart of one embodiment of OCR processing of graphicsfiles.

FIG. 4 c is a block diagram of one embodiment of the relationshipbetween discussions, documents, and singletons.

FIG. 5 is a block diagram of one embodiment of document subclasses andrelations to event types.

FIG. 6 a is a flowchart of one embodiment of the e-mail identityextraction process.

FIG. 6 b is a flowchart of one embodiment of the e-mail identitydisambiguation process.

FIG. 7 a is an illustration of one embodiment of the relationshipbetween colored graph elements.

FIG. 7 b is a fragment of one embodiment of a colored graph sample.

FIG. 8 is a flowchart of one embodiment of the initial actor graphconstruction process.

FIG. 9 a is a flowchart of one embodiment of an overview of the secondstage of actor 310 graph construction.

FIG. 9 b is a flowchart of one embodiment of edge creation and weightassignment during actor graph construction.

FIG. 9 c is a flowchart of one embodiment of bottom-up clustering duringactor graph construction.

FIG. 9 d is a flowchart of one embodiment of the process of derivingaliases during actor graph construction.

FIG. 10 a is a flowchart of one embodiment of the ownership/authorshipmodel.

FIG. 10 b is a flowchart of one embodiment of the ownership/authorshipmodel where the user logs on to multiple systems.

FIG. 11 is a flowchart of one embodiment of the process for detectingspam and other exogenous content.

FIG. 12 is a block diagram of one embodiment of the actor class andcontainment hierarchy.

FIG. 13 a is a flowchart of one embodiment of the alias versioningprocess.

FIG. 13 b is a flowchart of one embodiment of the actor parsing anddeduplication process.

FIG. 14 is a flowchart of one embodiment of the actor personalityidentification process.

FIG. 15 is a block diagram of one embodiment of the circle of trust orclique class hierarchy and associations.

FIG. 16 is a flowchart of one embodiment of the process of detectingcrisis/event-based circles of trust.

FIG. 17 a is a flowchart of one embodiment of the process of computingdocument similarity.

FIG. 17 b is a flowchart of one embodiment of document indexinggranularity and deduplication.

FIG. 18 is a flowchart of one embodiment of the process of computingactor involvement.

FIG. 19 is a flowchart of one embodiment of the iterative process ofbuilding the actor graph.

FIG. 20 a is a flowchart of one embodiment of the process of textblockidentification.

FIG. 20 b is a flowchart of one embodiment of the process of textblockattribution.

FIG. 20 c is a flowchart of one embodiment of the process of textblockattribution with OCRed documents.

FIG. 21 a is a block diagram of one embodiment of a partial eventhierarchy.

FIG. 21 b is a block diagram of one embodiment of a the elementrelationship in a sample ontology.

FIG. 22 a is an illustration of one embodiment of the clockdriftdetection process.

FIG. 22 b illustrates the concept of “warped time”.

FIG. 23 a is a block diagram of one embodiment of the pragmatic tagspectrum.

FIGS. 23 b-23 e are a flowchart of one embodiment of the pragmatictagging process.

FIG. 24 a is a block diagram of one embodiment of the relationship ofdiscussions to other objects.

FIGS. 24 b, 24 c, 24 d and 25 a are a flowchart of one embodiment ofphase one of discussion building.

FIG. 25 b is a flowchart of another embodiment of discussion building.

FIG. 25 c, 25 d and 25 e are a flowchart of one embodiment of phase twoof discussion building.

FIG. 25 f is a diagram of one embodiment of data structures used in thesecond phase of discussion building.

FIG. 26 a is an illustration of one embodiment of the evolution of anactor presence model for a small example discussion.

FIG. 26 b is a flowchart of one embodiment of ad hoc workflowidentification.

FIG. 27 is a flowchart of one embodiment of the discussion summaryconstruction process.

FIG. 28 a is a flowchart of one embodiment of the resolution itemdetermination process.

FIG. 28 b is a flowchart of one embodiment of the resolution templateselection process.

FIG. 29 is a flowchart of one embodiment of the discussion partitioningprocess.

FIG. 30 is a flowchart of one embodiment of the pivotal messagesidentification process.

FIG. 31 is a block diagram of one embodiment of the query engine.

FIG. 32 is a block diagram of one embodiment of the primary template andreturn types.

FIG. 33 is a flowchart of one embodiment of the relevance rankingprocess.

FIG. 34 is a block diagram of one embodiment of the query by exampleprocess.

FIG. 35 is a flowchart of one embodiment of the process for handlinganswer-formatted questions.

FIG. 36 a is a flowchart of one embodiment of the dynamic updatedetection process.

FIG. 36 b is a flowchart of one embodiment of the indexing process,including dynamic indexing.

DETAILED DESCRIPTION OF THE INVENTION

The present invention comprises a system for organizing documents and insome cases, portions of their content, into causally related sets ofinformation. These sets of documents will be referred to as“discussions.” The humans or computer system processes that contributecontent to a discussion will be referred to as “actors.” The system fororganizing such documents permits the use of advanced search techniques.

There is a clear need for a search technique that returns sets ofrelated documents that are not merely grouped by textual similarity, butalso grouped and sequenced according to the social context in which theywere created, modified, or quoted. By grouping documents in this way,the present invention is able to solve previously intractable problems.This makes it possible to retrieve a very precise set of documents froma large corpus of data. Hitherto, with conventional search tools, thishas only been possible by the use of complex search queries, and theresults have been restricted to documents that individually meet thesearch criteria. The present invention allows a precise set of documentsto be retrieved from a large corpus of texts using simpler searchqueries, and with the added benefit of presenting the returned documentsin the context of causally related documents (for example, meetingminutes sent out after a board meeting), even when those other documentsdo not, individually, satisfy the search criteria. This relieves theuser of the need for detailed prior knowledge (before running thesearch) of keywords likely to occur in any sought-after documents, or ofsuch details as the exact date on which a message was sent, or who sentit.

Consequently, less skilled query-writers can retrieve a set of relateddocuments (or an individual document) more effectively and accuratelywith the present invention than with conventional search engines.Furthermore, the availability of class-based ontologies composed of setsof synonyms, antonyms, and other linguistic data will allow users tosearch from a pre-selected set of likely relevant topics; in this case,the system will generate queries over sets of terms automatically,sparing the end user from manual input of large sets of terms.

In addition, it is of great importance in many contexts to determinepatterns of behavior in corpuses of data. Such patterns include sets ofactors who regularly communicate with one another, the paths down whichcertain types of information often travel, as well as anomalies in suchcommunications. Segmenting a corpus of documents into causally relatedchains facilitates the definition and capture of such complex patternsof behavior, as well as divergence from them.

An apparatus is disclosed for selectively grouping and retrieving setsof interrelated electronic documents and records. The apparatus uses acombination of different types of sociological and linguistic evidencefrom the documents in the corpus in order to establish highly probablecausal relationships between specific documents or document parts. Thisresults in the creation of a kind of “electronic paper trail” spanningmany different software applications, as well as potentially manydifferent media, networks or machines. We call the sets of documents ordocument parts that are joined together in this manner “discussions.”Discussions are container objects with a number of their own attributes,such as a name, lifespan, and set of actors associated in some way withtheir various content items. Organizing a corpus of electronicinformation into discussions not only allows data to be retrieved moreaccurately and completely, with a significant reduction in the amount ofunwanted data retrieved, but also allows the detection of potentiallyinteresting communication anomalies in the data that would otherwise bedifficult or impossible to detect. For example, the gradual eliminationover time of a particular person from discussions of certain kinds, orinvolving certain topics.

FIG. 1 depicts a typical networked environment in which the presentinvention operates. The network 105 allows access to e-mail data storeson an e-mail server 120, log files stored on a voicemail server 125,documents 505 stored on a data server 130, and data stored in databases140 and 145. Data is processed by an indexing system 135 andsociological engine 150, and is presented to the user by a visualizationmechanism system 140.

FIG. 2 depicts a typical digital computer 200 on which the presentsystem will run. A data bus 205 allows communication between a centralprocessing unit 210, random access volatile memory 215, a data storagedevice 220, and a network interface card 225. Input from the user ispermitted through an alphanumeric input device 235 and cursor controlsystem 240, and data is made visible to the user via a display 230.Communication between the computer and other networked devices is madepossible via a communications device 245.

It will be appreciated by those of ordinary skill in the art that anyconfiguration of the system may be used for various purposes accordingto the particular implementation. The control logic or softwareimplementing the present invention can be stored in main memory 250,mass storage device 225, or other storage medium locally or remotelyaccessible to processor 210.

It will be apparent to those of ordinary skill in the art that thesystem, method, and process described herein can be implemented assoftware stored in main memory 250 or read only memory 220 and executedby processor 210. This control logic or software may also be resident onan article of manufacture comprising a computer readable medium havingcomputer readable program code embodied therein and being readable bythe mass storage device 225 and for causing the processor 210 to operatein accordance with the methods and teachings herein.

The present invention may also be embodied in a handheld or portabledevice containing a subset of the computer hardware components describedabove. For example, the handheld device may be configured to containonly the bus 215, the processor 210, and memory 250 and/or 225. Thepresent invention may also be embodied in a special purpose applianceincluding a subset of the computer hardware components described above.For example, the appliance may include a processor 210, a data storagedevice 225, a bus 215, and memory 250, and only rudimentarycommunications mechanisms, such as a small touch-screen that permits theuser to communicate in a basic manner with the device. In general, themore special-purpose the device is, the fewer of the elements need bepresent for the device to function. In some devices, communications withthe user may be through a touch-based screen, or similar mechanism.

It will be appreciated by those of ordinary skill in the art that anyconfiguration of the system may be used for various purposes accordingto the particular implementation. The control logic or softwareimplementing the present invention can be stored on any machine-readablemedium locally or remotely accessible to processor 210. Amachine-readable medium includes any mechanism for storing ortransmitting information in a form readable by a machine (e.g. acomputer). For example, a machine readable medium includes read-onlymemory (ROM), random access memory (RAM), magnetic disk storage media,optical storage media, flash memory devices, electrical, optical,acoustical or other forms of propagated signals (e.g. carrier waves,infrared signals, digital signals, etc.).

FIG. 3 shows the flow of navigation in the present invention. The usercan submit queries 320, which return discussions 305. Each discussion305 contains at least two actors 310. Each of the actors 310 about whomthe user can submit queries 320 must appear in zero (0) or morediscussions 305 (an actor 310 can appear in 0 discussions 305 by beingconnected in some way with a singleton document 435 which, bydefinition, is not part of a discussion 305). An actor 310 can beassociated with multiple topics 315, and vice versa. Each discussion 305can be associated with multiple topics 315, and vice versa.

Overview

FIG. 4 a is a flowchart of one embodiment of the initial preprocessing.The system requires a corpus of electronic data as input (block 405). Inone embodiment, it will start off by doing a conventional spidering ofthe corpus. Alternative means of acquiring the data may be used. Thedata can be in any form, but in order to be processable by the system,must have at least some component that is textual in nature, and whichfurther corresponds to a natural language such as English, German,Japanese, etc. Prior to the data being processed by system describedherein, the corpus is passed through an MD5 or similar hash filter(block 410) in order to remove common exogenous files such as thoseassociated with the operating system. In many circumstances it is alsodesirable to remove additional files (block 415) that are binaries orASCII files containing programmatic code, executables, source code, etc.

Subsequent to the indexing process, in one embodiment any document thatis an image (for example, a TIFF,) will be passed through an opticalcharacter recognition (OCR) process to ascertain whether it is a “realimage.” FIG. 4 b is a flowchart of one embodiment of this distinguishingprocess. The next file is selected (block 450) for processing. If theprocess determines that the file is an image file (block 455) or agraphic, it returns to block 450, to select a next file. If the processdetermines that the file is an imaged document file, the system willperform OCR on the document (block 460), and resubmit the nowrecognizable document to the indexer (block 465). This process isiterated until all image files have been processed, at which point itends.

FIG. 4 c is a block diagram of the relationship between discussions 305,documents 505, and singletons 435. A discussion 305 contains two or morerelated documents. A singleton 435 is a document 505 that is not part ofa discussion 305.

Because the system heavily utilizes sociological evidence, it usesdocument types from an analysis standpoint. FIG. 5 illustrates oneembodiment of the document type and sub-types that the present systemuses. Note that some documents 505 or records may be of more than onetype. With the exception of meta-documents 505, which are often notcreated contemporaneously with the rest of the corpus, the other kindsof documents 505 fall into 2 categories: those which are used by thesystem primarily to establish context, and those which are placed in thecontext that they provide.

Communication Documents 510: Any type of text-containing document 505that also contains information about both its creator(s) and itsrecipients, coupled with a timestamp for each time it was sent. E-mailand IMs (instant messages) are examples of communication documents 510.

Communication Events 570: Any type of communication that containsinformation about at least one creator and at least one recipient, andhas both a start time and an end-time associated with it. It does notitself necessarily contain any other text, or have any other informationassociated with it. However, a communication event 570 may be associatedwith a document 505; for example a conference call might have meetingminutes recorded for it; a voicemail might have a speech to text indexerattached to it. Phone records and calendared meetings are examples ofcommunication events 570. Communication events 570 are generally recordsin larger files, which require extraction, rather than being individualdocuments 505.

Regular Documents 515: Any type of document in the colloquial sense ofthe term; a file which contains at least some textual content. Examplesinclude word processing documents, HTML pages, and financialspreadsheets. Regular documents 515 can be fielded documents 550, forexample databases, or spreadsheets. They may be largely graphical incontent, such as slide shows. Or they may be text documents 535, whichjust contain regular text or mark-up language. A regular document 515requires at least one author, a creation date, and a modificationhistory; its content may have numerous other properties which arediscussed below.

In addition, regular documents 515 have a special event 1525 type thatis associated with them:

-   -   Versioned Documents 555: Any document 505 whose changes are        recorded in some formal and regular manner. Examples include any        document 505 that is systematically under some type of version        control, or that has consistently had change tracking enabled.    -   Lifecycle Documents 565: A lifecycle document 565 is one that is        not under strict, or any version control. Rather, its history of        modification is reassembled to the extent possible by the system        by a digital archaeology process described below.

In addition, regular documents 515 have a special event type that isassociated with them:

-   -   Modification Event 540: An edit to a regular document 515. This        may be one of the following:        -   An edit that is part of a check-in message to a document            repository system. Such an edit may have a comment or            check-in message associated with it, and a list of actors            who received notification of the change.        -   An edit that is tracked by a change tracking system, this            information being appended to the regular document 515            itself as meta-data.        -   An edit for which no time stamp exists, but which is known            to have occurred due to the fact that there are different            versions of the regular document 515.

Structure Documents 530: A document 505 or series of records thatprovide information on organizational or process structure. Examplesinclude, but are not limited to, Human Resource management and workflowmanagement systems. Structure documents 530 are analyzed solely forpurposes of analyzing the content-bearing documents 505. Structuredocuments 530, can include actor lifecycle events 545, such as namechange, hiring, termination, promotion, transfer, vacation, leave ofabsence, etc. For example, an HR database might indicate that “AnnJones” had previously been known as “Ann Ross” and hence that the twonames in fact correspond to the same actor 310. Structure documents 530are generally analyzed on a per-record basis. Task lists in personalorganizers can also be considered structure documents 530.

Logging Documents 520: A document 505 that contains records involvingany type of event, for example access logs or transaction logs. Loggingdocuments 520 are generally evaluated on a per-record basis.

Meta-Data Documents 560: A document 505 containing information aboutother documents 505 or their contents, or to which these documents 505and their contents may be pertinent. Examples include depositions ofwitness involving particular sets of documents 505. In general,meta-data documents 560 are external documents 525, since they do notbelong in the corpus.

External Documents 525: A document 505 that has been created external tothe corpus of data, and integrated into the corpus either by a humanoperator or automatically by the system as useful external context.Examples include a graph of the value of a particular stock, or variousarticles about it might be very pertinent to a securities violationmatter.

Communications documents 510 and communications events 570 areidentified by the system according to their file type at the time thecorpus is assembled. This is done both on the basis of file extensions,and by examining a small portion of the file (in one embodiment, thefirst 12 bytes) in order to confirm that the file format and extensionare consistent with one another. In the event that they are not, or thatthe file extension is unknown to the system, the format will takeprecedence, and the system will parse the file looking for “hints” as toexact type. For example, the presence of e-mail headers would indicatethat the file is, or at least contains, an e-mail. Documents 505 lackingsuch hints are just considered to be regular documents 515. Structuredocuments 530 are determined similarly; in addition to recognizing thefile and data storage formats of common applications such as PeopleSoft,the system contains templates to recognize employee data. Loggingdocuments 520 are recognized similarly; while the extension is lesslikely to be recognized, the series of records with timestamps providesa sufficient type identification mechanism.

E-mail is treated specially by the system during the spidering and typeidentification process. FIG. 6 is a flowchart of one embodiment of theemail identity extraction process. Many common e-mail applications, suchas Microsoft's Outlook, store arbitrarily large numbers of individuale-mails in a single file. During this process, these individual mailsare extracted (Block 605) into individual documents, and placed in theindex (Block 610). At this time, the meta-data associated with eache-mail is extracted (Block 615) and placed in separate fields. Thisincludes the “from:, to:”, “cc:”, “bcc:”, time stamp, and subject, butalso additional information when available, such as “reply to:”information. Note that where the system is unable to parse a text-formatmail box (for example, due to file corruption), it exhibits gracefulfailure by treating the mail box as a text file.

When the document in question is an e-mail message, the message shouldalready have been assigned a unique message id by the e-mail softwarethat generated it. However, in practice, many e-mail clients do not adda message ID, relying on the SMTP relay to do so instead. This oftenleads to e-mail stores containing messages that have no message ID. Forexample, a user's outbox folder (containing items sent by that user)contains messages that have not arrived via SMTP. If such a message isdetected (Block 620), the system generates a unique identifier (Block625). This can be done using any scheme that is guaranteed to generateunique IDs.

In many cases, the same file will be found within several e-mail stores(if e-mail store files from several computers are used as input to thesystem). However, only one instance of each e-mail will be stored in therepository. Additional copies that are found (Block 630) will result ina new location reference being added to the record for that e-mail, andthe “instances” counter for that mail will be incremented accordingly.In one embodiment, the system determines that the e-mail is a duplicateby doing a hash on the combination of header information and content. Inone embodiment, the hash is included in the communication UID. If themessage is a duplicate, it is not stored, and the process continues tobock 605, to process the next message.

The standard Message-ID field is stored in the repository, as are othere-mail headers. For example, a combination of the Message-ID header (ifpresent), the References header and the In-Reply-To header are stored inthe repository (Block 635) and will later be used to construct edgesbetween the vertices of the communications graph. The process determineswhether there are any more email messages to process (Block 640), and ifthere are, the process returns to block 605. If no more messages remain,the process ends (Block 645).

Another email-specific issue involves email agent garbling of actornames. FIG. 6 b is a flowchart of on embodiment of email identitydisambiguation. For each new e-mail message, the system computesstatistics (Block 670) about actor-related word n-grams in the e-mailheaders (for example, From, Reply-To, To, Cc, Bcc). This technique isespecially powerful when parsing ambiguous headers (Block 675), as inthe following example of a message sent to two actors in a corpuscontaining the two actors Fred Smith and Ann Jones:

To: Fred, “Smith;”, “Smith;”, Ann, “Jones;”, “Jones;”

Statistical analysis (Block 680) of n-grams in this and other messageswill reveal that the sequences (ignoring punctuation) “fred smith” and“ann jones” occur far more often than “smith ann” or “jones jones”. Thisallows the system to interpret the following header (in a formatcommonly used, for example, in some versions of the Microsoft Outlook™e-mail client) as referring to a single recipient, even though internetRFC 822 states that the comma indicates two recipients rather than one,for example: To: Fred, Smith.

Returning to FIG. 5, in some instances the differentiation of whether adocument 505 is a versioned document 555 or lifecycle document 565cannot be determined initially, and as described in a subsequentsection, this must be determined at discussion building time.

Meta-data documents 560 can in general only be identified or provided bythe system's user. The exception to this is the case of a date-limitedcorpus to which more recent meta-data documents 560 are added. In thisevent, the meta-data documents 560 can be identified by having a datestamp that is outside the valid set of dates for the corpus. Meta-datadocuments 560 do not need to have authors or titles. In fact their onlyrequired attribute is non-null textual content, although in order to beuseful, a meta-data document 560 should contain information about theinformation in the corpus or its actors. As noted above, a depositioninvolving data in a corpus is an example of a meta-data document 560. Ifmeta-data documents 560 are added to the corpus after the initialindexing, a separate process will be kicked off to index them.

Document 505 de-duplication is done during the indexing according tocurrently practiced methods. However, the data store containing theindex also contains a complete set of pointers to the original locationsof all of the duplicates. This information will be used later. Thisleaves us at this stage with a uniqued set of documents 505 which havebeen typed by the system according to the types described above (withthe exception of those types of documents 505 that may be added to thecorpus after the fact.) Note that any document 505 that the systemcannot classify will be placed in an “unknown” category.

These documents 505 can be reviewed by the user, who has the choice ofcategorizing them individually or by type (for example, adding a newfile extension to those recognized by the system). In addition, the usermay add a regular expression specifying valid combinations of the firstN characters of the file. Generally, however, documents 505 that do notfall into any of the above categories can be considered exogenous. Thatis, they were neither created nor meaningfully modified by any of theactors related to the corpus of documents 505 being analyzed, nor addedto the corpus because of their relevant content. Examples of suchexogenous documents 505 include, but are not limited to, operatingsystem and software application files

In order to proceed with the next step in analyzing the documents 505,the system generates a set of actor profile information. Since much ofthe interesting information about actors involves their relationships toother actors, this information is represented in a graph. In oneembodiment, the graph representation employed by the system is a variantof a colored graph as currently practiced in computer science (seediscussion building section below for a more detailed description).Actors are implemented as A-type vertices. Different types and strengthsof relationships among the actors are implemented as appropriatelycolored edges. The actor relationships form a sub-graph in the coloredgraph. To give a precise definition, an actor is a person or computersystem that is referred to in communications from the corpus or thatproduces, sends or receives communications in the corpus.

Because the system is primarily concerned with determining probablecausality between objects in the corpus, the object class of greatestinterest is the actor. The presence of the same actors—or actors closelyrelated to them—can be used to assess causality when there is otherwiseinsufficient data. Many people may use similar language to discusssimilar things, but there is no cause-effect relationship involved inthese communications apart from those instances where the actors areshared. Similarly, the identity of the actors who authored or modified adocument is a key clue in determining whether two documents have anancestral relation to one another, or whether they are just twosimilar-looking documents created independently at different times. Thusthe actors' identity should be accurately assessed before any furtheranalysis on the documents is performed.

Therefore, the system starts with correctly identifying the variouselectronic identities that correspond to a human user, as well as otherelectronic records that identify him or her, such as those in a humanresources management system. It is not at all uncommon for there to be avery large number of different references to the same individual in acorpus obtained from a corporation or other organization. For example:

-   -   Official work email account    -   Aliases to this account    -   Personal (non-work-related) email accounts    -   IM monikers    -   Logins to or accounts for different systems, such as document        repositories or HR systems    -   Name, or sequentially used names, as recorded in an HR system    -   Name as recorded in a license for an application such as        Microsoft Word; this is the name that will be used in the        “author” meta-data field by the application.    -   Name as recorded in numerous forms filled out over time for        different reasons, for example purchase orders or expense        reports.    -   Name as recorded in form-fillers, electronic wallets, or similar        items

While commercial systems exist to normalize slightly differingvariations of people names—and one embodiment of the present inventionuses these existing systems—this normalization is only a small part ofthe much larger problem of uniting as many of these different electronicidentities as appropriate into the same actor ID. Many of theseidentities may have no lexical similarity whatsoever to one another. Forexample, the actor “John Smith” may have an email accountbluebear@yahoo.com. Such existing systems will not ferret out thisinformation unless it appears in the same record as the actor name.Often this will not be the case, especially as many people—perhaps evenmost—have at least two unrelated email accounts.

FIG. 8 is a flowchart of one embodiment of the actor graph construction.In order for the invention to perform this unification of differentelectronic identities and reference into actor IDs, it starts with ananalysis of all available email stores. Recall that email instances wereidentified as a distinct type during the initial indexing phase, and,where appropriate, individual emails were extracted from any archivalformats, and their meta-data placed into separate fields for subsequentanalysis. This analysis is now performed. The system reads the recordcorresponding to each email in the data store (Block 810). Using themeta-data information in the actor-related fields (to:, cc:, bcc:, andfrom) the system constructs a sub-graph of electronic identityrelationships.

In one embodiment, each electronic identity is a first pass alias record(vertex type AP1.AR, for color AP1, type AR), and each directed edgerepresents co-involvement in the same email communication. Two differentedge colors are defined for this subgraph, FT (from-to) on directededges and TT (to-to) on undirected edges. For example, if John Smithsends an email to Jane Jones and Jack Barnes, there would be anFT-colored edge from John Smith to Jane Jones and another from JohnSmith to Jack Barnes, and a TT-colored edge between Jane Jones and JackBarnes.

For each relationship, if a link doesn't yet exist, it is generated, andif an edge linking the two electronic identities already exists, itsweight is augmented by one (Blocks 820). If there is more than onerecipient (Block 830), appropriate to-to links are generated betweenrecipient pairs, or the weight of the existing links is agumented by one(Block 840). This process continues until there are no further recordsto process (Block 850).

At this point, these are still electronic identities, not actors.Therefore, in one embodiment, the process of building this graph will berepeated again at a later stage with actor IDs, rather than theirvarious electronic identities. The next step is to unify theseelectronic identities associated with email accounts into actor IDs. Asnoted previously, a 1:1 correspondence between actors and electronicidentities is almost never expected. In reality, most actors will havemany electronic identities. WThese different identities are alsoreferred to as “alias terms.” Further, some electronic identities may beused by more than one person. (This latter problem will be dealt with ata later stage in the processing, and is presumed to be far less commonlyoccurring and important than the issue of single actors having manyelectronic identities.)

Next, a second sub-graph is constructed using the technique described inthe following paragraphs. This sub-graph uses AP2.AR vertices andundirected CA-colored edges. FIG. 9 a is a flowchart of one embodimentof this phase of the actor graph construction.

In order to optimize the speed of edge creation in the graph, the systemcreates auxiliary inverted indexes on any kind of document that appearsto contain both human names and email addresses in close proximity toone another, e.g. “alias records” (Block 902).

FIG. 9 d is a flowchart of one embodiment of deriving alias records. Tosplit the name side of an email into candidate names and addresses, thesystem looks for common separators such as dot, dash, underscore. Othermethods include permutations of first name/last name/middle name/companyname; middle initial insertion/removal (Block 972).

In the event that there is an explicit alias mapping file (Block 974)anywhere within the corpus, (any document containing data of the form:Human name, email account 1, email account 2 . . . email account N”,)this information will be automatically considered as evidence that eachof these email accounts maps to the actor specified at the beginning ofthe record. An exception to this is the case in which an email accountappears to be associated with more than one actor. In this event, it istreated by the system as an aggregate actor. This is one of the type of“structure documents” that the system looks to identify during theinitial data acquisition. For one embodiment, the system createsCA-colored edges (Block 904) for each such pairing.

If one of the types of structure document available is an HR database,the system looks for a “maiden name”, “previously used name,” “née” orsimilar fields (Block 976). In those records where these fields arenon-null, a new first name last name combination is created, and isadded as a second name to the existing AP2.AR vertex.

As noted in FIG. 9 a, the system creates edges and assigns weights tothose edges (Block 904). FIG. 9 b is a flowchart of one embodiment ofedge creation. In order to ensure that there is a CA-colored edgebetween all alias record pairs, e.g. pairs of AP2.AR vertices, i.e. theemail address and the actor name, the system creates an edge betweeneach alias record pair (Block 920) and initializes its weight to 0 ifthere is currently no edge between them. For each pair of vertices addthe product of the two Alias term similarities to the current edgeweight,

The system creates a CA-colored edge between AP2.AR vertices if a pairof corresponding alias records can be constructed from an AP1.AR vertex,(Note that some embodiments of the invention may weight these differenttypes of evidence differently):

-   -   Permutations of first name last name with same domain connected        by a to/cc or to/bcc link (from the previously constructed        graph) (Block 922).    -   The same characters prior to the “@”, but a different domain        (Block 924).    -   Co-occurrence of either the above, and a from/to link from an        account with the same characters prior to the “@” but with        different domains (Block 926).    -   Co-occurrence of n-grams in nodes linked directly together in        the previous graph (Block 928).

In one embodiment, the presence of FT or TT colored edges between twocommunication documents, is considered negative evidence for linkingtheir associated alias records, or a negatively weighted link. This isbecause a person rarely sends replies to themselves.

In one embodiment, this analysis is repeated at a later stage in theprocess when additional evidence is available. In another embodiment ofthe invention, this analysis is conducted without benefit of thepreviously constructed graph, for reasons of computational efficiency.In these embodiments, only the lexical part of the test is performed.

Note that in order to protect against or correct mistakes, a set ofeither abstract or specific constraints may optionally be consulted inthe edge construction process before merging two records.

AP2.AR vertices are decorated with a weight. In one embodiment of theinvention, the weight is the number of occurrences of the vertices AliasTerm in the corpus. Weights may be increased if the alias termcorresponds to a human actor name using a well-known convention (Block906). For example, “John Smith” mapping to john.smith@acme.com.

In one embodiment of the invention the following bottom-up clusteringprocedure (Block 908) is then applied. One embodiment of this process isdescribed in more detail by FIG. 9 c.

The CA-colored edges are sorted by decreasing weight (Block 940). Theheaviest edge is selected (Block 942), and its head and tail are merged(Block 944). This operation represents unification of the two aliases.The alias term similarities are recomputed in the merged vertex as thesum of the two corresponding similarities (or zero if only present inone vertex) in the head and tail vertices(Block 946). Using the sameprocedure as for edge creation, the process adjusts the weights of theedges in order to account for any vertices that were merged in theprevious step (Block 948). The edges are then adjusted, sorting bydecreasing weight (Block 950). This process is repeated until theheaviest edge falls under a predefined similarity threshold (Block 952).The remaining alias vertices form the set of all de-duplicated aliasterms. For each such alias term create an actor vertex (node type A),another graph of relations between these vertices will be built afterdiscussions are created.

Other embodiments of the invention may be implemented with differentclustering algorithms. In one embodiment, the “pick the heaviest edge”can be vetoed using an algorithm specified by the user. In case of veto,the edge is simply skipped. If all edges are vetoed, the clusteringalgorithm terminates. Similarly, in the case of an alias mapping filethat is certified by the user as having error-free information, thisfile can be fed into the system to both ensure that the email aliasesspecified in the file are bound to an actor with the correct first name,last name combination, and to no other actor. Note that this process issubquadratic in the total number of aliases.

In one embodiment, this sub-graph is updated again after the documentlineage has been computed in the next step in the processing.

In one embodiment, the merging process is fully auditable. During amanual review, users of the system can observe and override an aliasterm merging decision or specify different merges. Similarly, the usercan add constraints (such as ignoring input from certain files orsources) and cause the graph building to be re-run with the constraintsobserved. Note that because of the sparsity of the graph, the overallprocess is subquadratic in the number of vertices.

In another embodiment, there are additional methods used in conjunctionwith the above in order to disambiguate actors, including analyzing thesignature footer of messages (such as vcards, any signature informationstored in an email client, or any text apparently containing suchinformation as actor name, telephone number, and email address), andusing existing techniques to identify the prose style of the differentauthors. These techniques include, but are not limited to, such detailsas preferred syntactical structures, lexical items, and characteristicmisspellings and typographical errors. These techniques are particularlyuseful in a situation where Actor A sends a message to Actor B, andActor B replies as Actor C. It is important for the system to be able todetermine that Actors B and C are the same individual.

This careful attention to alias merging allows the system to keep tracknot only of all messages sent or received by a particular actor, butalso of instances in which an actor is removed from an e-mail thread inmid-discussion. If an actor can be determined not to have participatedin the crucial portion of a discussion during which a particulardecision was taken, that individual can potentially be shown not to beresponsible for the decision in question. It therefore becomes mucheasier to “prove the negative”, a notoriously difficult goal to achieve.To these ends, the present invention constructs a list of aliases thatresolve to a particular actor, and consults this list whenever an actorneeds to be identified.

Note that the system begins this analysis with email because it isverifiably correct; if email comes from a particular address, it isproof that the address existed. That an email was sent to address doesnot make it verifiably correct. While one embodiments of the inventionmatch up email bounces to the original mail (using reply-to and otherinformation stored in the headers) hence invalidating the alias term,this will not solve the problem where the sending actor has erred in theemail address, but the address nevertheless still resolves, but to anunintended actor. This is why different techniques, described below, areused in order to eliminate such very low frequency actors.

Files identified automatically by the system as being from a known HRsystem, which match one of the system templates for HR information, orwhich are manually identified by the user as being HR records areconsidered to be accurate by the system, but not necessarily complete.For example, there may be discrepancies in the names found in suchsystems, and what names people actually use on a daily basis. Inessence, in such cases the actor in question has two concurrently validnames. Other types of actor database information is similarly treated;for example CRM and SFA systems are considered to provide reliableinformation about customers and sales contacts.

This initially constructed information can be both verified andaugmented by assigning ownership to different machines or media in thecorpus. FIG. 10 a is a block diagram of one embodiment of ownership andauthorship determination. The system does this by a combination ofreading system user information (Block 1002) on the machine, seekingemail client information (Block 1004) contained on the machinespecifying one or more email address to use as sender information, andby examining the author meta-data (Block 1006) of the files resident onthe machine (or media.) The system attempts to extract authorinformation based on the above data sources (Block 1008).

If there are no documents created in a particular application, such asMicrosoft Word, whose author meta-data corresponds to the actor-owner ofthe machine, or if only a joke or meaningless ID is found (Block 1010),the author meta-data information from this application on thisparticular machine will be disregarded by the system as being inaccurate(Block 1014). In addition, the system has a list of meaninglessplaceholder actors that will also be disregarded (Block 1014), forexample “A user.” Further, common joke names like “Bart Simpson” will besimilarly disregarded (Block 1014) via the use of a look-up table,unless there is already a node in the actor graph with this name (e.g.one that is demonstrably responsible for the generation ofcommunications.) In either of these cases, authorship will beestablished (Block 1026) in a later step by a combination (Block 1022)of email trail, check-in history, presence on the machines “owned” byother actors, or failing any of these (because only one instance of thedocument was ever found,) the authorship will be assigned (Block 1028)to the owner of the machine.

If the ID found is not a joke or meaningless ID, the process determineswhether there is conflicting information from various data sources(Block 1012). If there is not conflict information from data sources,e.g. all sources agree on the actors identity, the process assignsauthorship based on the evidence found (Block 1026). If there isconflicting data, the system determines whether email client informationis available (Block 1018). If no such information is available, emailtrails, check-in history, and presence of other items owned by otheractors is used (Block 1022). The process then continues to block 1024.

Email information is considered to be the most reliable for assigningownership, and thus more importance is assigned to email clientinformation (Block 1020), if any is found. Thus, email clientinformation, if available, will take precedence when there is adiscrepancy. In the unusual case that there is more than one such emailstore on the same machine or media which correspond to differentwell-known actors, select the obviously larger of the two. If the storesare approximately the same size, but the email stores contents do notoverlap in time, the machine is considered to have been passed from oneactor to another. This date range information will be kept, and used todisambiguate ambiguous document ownership at a later stage. If the dateranges of the email stores do overlap in date, machine ownershipinformation can only be used in order to disambiguate between the actorsassociated with the 2 or more email stores.

If evidence of ownership is found (Block 1024), authorship is assignedbased on the evidence found (Block 1026), otherwise, authorship isassigned to the owner of the computer (Block 1028).

In the case of logins to different systems, the system can process theuser information from these systems. FIG. 10 b is a flowchart of oneembodiment of processing user information in the case of multiple systemlogins. If there is a linkage to an actor name or email address (Block1050), the system will add this information to the actor profile. If nosuch information is found, the user adjudicates (Block 1058). If suchinformation is found, but the name is ambiguous (Block 1052), the systemwill leverage the identities of any unambiguous actors to resolve (Block1054) the ambiguous name to someone in the same department. In oneembodiment, this is done by moving upwards one step hierarchically, andsideways to peer groups each iteration, or failing that, membership inthe same circle of trust as the unambiguous actors. If the name remainsambiguous (Block 1056), because either two actors with the same nameappear at the same level, or because none are found, a user will have toadjudicate (Block 1058). Additionally, the user of the system maymanually link an account to an actor, in those cases where the systemlacks the evidence to do it. Similarly, any erroneous linkages may becorrected manually.

Other types of files may be entered into the system by the user, but mayrequire the user to specify which fields (or otherwise delimited data)correspond to which actor properties.

FIG. 11 is a flowchart of one embodiment identifying spam and otherexogenous content. In general the following should all be true:

-   -   Each actor represented in an organizational chart should        correspond to one entry in the HR system.    -   Each actor represented in the HR system should have at least one        electronic identity. This expected to be true, except in very        rare outlier cases.    -   Some, but probably not all, of the actors noted in personal        address books, and systems managing information for such actors        as vendors, suppliers, customers, partners, prospects, etc) will        have at least one other manifestation in the corpus—for example,        sending an email.

These actors, as well as any aggregate aliases comprised of them, areconsidered to be “indigenous” to the corpus (Block 1115).

Any actor that is represented in the HR database (Block 1105),organizational chart, personal address books (Block 1120), or systemmanaging non-employee actor information (Block 1110), or has receivedemail from an actor indigenous to the corpus (Block 1125) is consideredto be indigenous (Block 1115).

An actor that does not meet any of the above criteria or has sent emailto one or more actors, but none of these actors have replied isconsidered to be either spam, or some kind of exogenous creator (Block1130) or distributor of data, for example on-line newsletters. Allcommunications from these actors may optionally be removed (Block 1135)from the data to be processed into discussions. The process thenproceeds to the next actor (Block 1140). For one embodiment, thisprocess is iterative. For one embodiment, the process may be repeatedwith a second pass once the majority of indigenous actors have beenidentified.

An alternate embodiment of the invention can also employ currenttechniques for identifying spam, including but not limited to: Bayesiananalysis on content, detecting random characters at the end of themessage title, and lookup of known domains emitting spam. Suchembodiments allow the user to specify how many tests must be positivebefore the item is automatically removed from the data set.

Actor Attributes (Actor Profile)

FIG. 12 illustrates one embodiment of actor class and containmenthierarchy. Individual human actors 310 have the following properties inthe invention:

-   -   UID    -   First Name    -   Other first name    -   Middle Initial    -   Nickname    -   Last Name    -   Last Name 2 (maiden name, other previously used name, or        shortened form of name used in computer systems)    -   Organization    -   Department    -   Job Title    -   Alternate Job Title    -   Primary Work Email    -   Primary Personal Email    -   Other email accounts (as needed)    -   IM moniker    -   Account login information record (as needed)    -   Primary language (that is, spoken language)    -   Other languages (as needed)    -   Primary register    -   Importance    -   Personalities 1220 (as needed)    -   Calendar    -   Task lists    -   Work phone    -   Home phone    -   Mobile phone    -   Lexical fingerprint (or footprint.) This is created by comparing        a word frequency table for the whole corpus (the inverted index)        and that for each individual actor 1210, omitting the 100 most        commonly used non-stop words from the entire corpus. Any words        which are found only for a given actor or in a significantly        higher distribution than the rest of the corpus, become part of        that actor's lexical footprint. For example: if ‘godzilla’        occurs 100 times in the whole corpus, its frequency is going to        be 0.00x%, but if actor A shows 32 instances of ‘godzilla’ in        his communications, his frequency is going to more like 0.0x%,        which will be a significant enough distinction in usage to cause        it to be added to his footprint.    -   Circles of Trust (as needed)    -   Communication behavior records (as many as are needed) These are        of the format:    -   Actor-personality (because an actor 310 may have more than one        personality” 1220)    -   Version    -   Interaction counts        -   1) From-count        -   2) To-count        -   3) Cc-count        -   4) Bcc-count        -   5) Was-cc'ed-count        -   6) Was-bcc'ed-count

The “interaction count” fields can be divided up by any time intervaland replicated accordingly, depending on how the user configures it.

Note that many other additional actor 310 property fields may be addedif available, such as demographic information—age, gender, maritalstatus, race, etc. If this data is available it may optionally be usedto perform correlations and tallies. In fact, any arbitrary attributemay be added and used in this way. This information is not used in anyway to build discussions 305. However, it can optionally be used toprovide “hints” to the anomaly detection engine described in a latersection. Such hints, for example, that membership in a particularminority group may have caused the actor 310 to be left out of certaincommunications can be used to rank anomalies, query on the relevantanomalies, and determine whether these anomalies are statisticallysignificant as a combined class.

Many of these fields may be null for any given actor 310. A valid actor310 requires at least one electronic identity 1225. Other informationmay be calculated and/or filled in later, or remain permanently absent.However, in one embodiment, each time a new actor node is generated, anew default set of such fields is created in the data store. If, at alater stage in the process, it is determined that the same actor 310exists more than once in the data store under different identities, theinteraction count records are merged, as are other attributes that canhave either 2 values, or a variable number of values. In the case ofsingle-valued attributes, the user decides which values are to beretained.

Some actors 310 are aggregate actors 1205. For example, a mail aliasthat corresponds to a department of people. In many corporations, suchmail aliases are used very extensively, and so cannot be overlooked.These aggregate actors 1205 have the following properties:

-   -   Name    -   UID    -   Email    -   Login account records (as needed)    -   Department (may be null if the aggregate actor 1205 has no        correspondence to a particular organization or group.    -   Actors 310 (as many fields as needed)    -   Presumed Version    -   Personalities 1220. This is determined by how many distinct        actor lexical footprints or signatures pop up in communications        sent from the aggregate actor 1205 identity.    -   Calendar; some aggregate actors 1205 may have calendars        associated with them.    -   Login Account info: This is important information to have in        order to trap all accounts that a particular actor had available        to him, and hence may have used to avoid using his own identity        to access certain data.    -   Creation date: If not known, the date of the first appearance of        the identity is used.    -   Deletion date: Will often be null; else the date on which this        aggregate identity was explicitly deleted.

Note that aggregate actors 1205 are ultimately expanded to their memberlist, although the data that there was the intermediary of an alias ismaintained. However, this expansion is not always a trivial problem,since the members of an alias may change over time. FIG. 13 a is aflowchart of one embodiment of identifying alias versioning. Versioninformation on alias membership is often not maintained. In the face ofthis difficulty, the system will attempt to reconstruct the history ofthe alias by executing the following, in order of precedence:

-   -   Using any structure document 530 that maps departments or        organizations to mail aliases, the system will look up the HR        records (Block 1305) as to who was in each department or        organization that corresponded to an alias during what interval        of time. If there is no way to do this in an automated fashion        (Block bus 205), the system will allow the user to manually        perform the mapping, or to specify a mapping rule.    -   Analyze the set of individual actor 1210 replies (Block 1315) to        email sent by an aggregate actor 1205. If someone replies to        such a mail, it means that he was on the alias list.    -   Next, add any actors 310 for whom an email client file (or        server records) indicates that the person received mail (Block        1320) from this aggregate actor 1205. That is, the mail is still        in an “inbox” or other mail folder belonging to this actor 310.        Similarly, identify any actors 310 who had the alias in their        personal address book. Add any new actors 310 found in this        fashion to the list.    -   Look up each actor 310 in any HR data sources (Block 1325) that        exist in the corpus to determine the set of their lifecycle        events 545 (i.e. hire, termination, transfer, promotion) as well        as which departments they belonged to. If all of the actors 310        belonged to the same lowest level department, add (Block 1330)        any other actors 310 that were also in this department.    -   Finally, in order to create the versioning information, use the        lifecycle information extracted above to determine when the        actor 310 first entered or left the universe of the corpus, and        where applicable, when they transferred in or out of a        particular group. In another embodiment of the invention, the HR        information is not considered, and only empirical evidence        offered by saved emails is used. In one embodiment, the user may        normalize department names during this process. (Block 1335)    -   Bump the presumed version number of alias by one for each        distinct day on which a change occurred (Block 1340). This is to        avoid counting a single mass change, for example, a group of 20        actors 310 being transferred to another division, being counted        20 times.    -   Each version of an alias is represented by a separate vertex in        the graph. In one embodiment, the version number is appended to        the core actor 310 id.

Note that alias versioning can sometimes be an important issue, becauseit can be used to establish that an actor 310 almost received a certainmail sent to an aggregate alias, even if no versioning information forthe alias exists, and the copy of the mail sent to that actor 310 haslong since been destroyed.

Returning to FIG. 12, aggregate actors 1205 do appear in onerepresentation of the actor 310 sub-graph. In one embodiment, a mailsent to an alias will generate a link (or an additional link weight)from the individual actor 1210 sending it, to the aggregate alias, andthen N links (or additional link weights,) to each of the members of theaggregate alias, where N is the number of such members. This is becausethe distinction between sending to an alias and to individual actors1210 can be important under some circumstances, later in the processduring the communications anomaly detection phase. For example, an actor310 sending an email to nine individual actors 1210 when there is anaggregate alias that had been previously used by this actor 310 thatcontains these nine actors 310 as well as one other, could be used asevidence that the actor 310 was attempting to prevent the tenth actor310 from seeing the message.

Some actors 310 may be computer processes, for example a daemon thatautomatically sends out status reports. While such non-human actors, orsystem actors, 1215 are generally not very interesting, for the sake ofaccounting, they cannot be ignored. Such non-human actors 1215 have thefollowing properties:

-   -   Name (e.g. process name)    -   Host name    -   UID

FIG. 13 b is a flowchart of one embodiment of actor parsing anddeduplication process. While items remain in the alias-bearingcontainers in the corpus (Block 1350), the system picks an item (Block1351), and checks whether it is parsable into distinct alias terms(Block 1352). If so, then, while alias terms remain in the item (Block1353), it picks an alias term (Block 1354), splits it after the emailaddress (Block 1355), adds occurrence and n-gram frequencies for theterm components (Block 1356), corrects the frequencies (Block 1357) andadds a corresponding vertex if none exists (Block 1358). It then accruesthe weighted term components with the alias vertex (Block 1359). If theitem is not parsable into distinct alias terms (Block 1352), the systemdelays processing until the n-gram model is built (Block 1360), segmentsthe item (Block 1361), and continues to extend the segment while then-gram frequencies remain above the threshold (Block 1362). Separateindexes are built from keys (Block 1366). For each type of key (Block1367), and for each pair of approximately equal keys (Block 1368), andof alias vertices containing the pair of keys (Block 1369), compute thefirst part of the arc weight with the head vertex (Block 1370).Similarly compute the second part of the arc weight with the tail vertex(Block 1371). Add to the weight of the arc the product of the two weightparts (Block 1372). Arc weights computation will complete insubquadratic time (Block 1373). Next, apply a hierarchical graphclustering (with constraints) (Block 1374). Pick the arc with thelargest weight. If the alias pair is on the list of negative exceptions,ignore it and pick the next arc (Block 1375). If the arc weight is belowthreshold (Block 1376), clustering is finished. The remaining clustersare deduplicated personalities 1220 (Block 1377). External rules can beused to cluster certain personalities 1220 (Block 1378). The remainingclusters are actors 310 (Block 1379). If, on the other hand, the arcweight is not below the threshold (Block 1376), cluster the two endpointaliases. The alias with the largest weight becomes the clusterrepresentative (Block 1363). Add up the termwise weights of the twoclusters and let them be the weight of the new cluster (Block 1364),then recompute the modified arc weights (Block 1365).

Actor Personalities

A given actor 310 may (consciously or not) keep different kinds ofmatters well separated in his communications. For example, an actor 310may have discussions with a first set of persons using English, from hiswork email account. But he may correspond with his confidants in adifferent language. He may also exchange communications on differenttopics, or in different registers via a private email account using ananonymous moniker. We refer to these distinct and different usages ofelectronic identities 1225 as ‘personalities’ 1220.

FIG. 14 is a flowchart of one embodiment of actor personalityidentification. The system determines if an actor has more than oneelectronic identity (Block 1405). If not, the process continues to thenext actor (Block 1410). Otherwise, the process continues, to determinewhether the user has multiple personalities 1220. Note that the presenceof multiple electronic identities 1225 by an actor 310 does not byitself constitute evidence of multiple “personalities” 1220. Rather itis very clear distinctions in the usage of some of these identities thatcauses the system to make this distinction.

In one embodiment, such potential differences are analyzed by clusteringall the content (Block 1415) created by a particular actor withoutrespect to the various electronic identities used to create the content.If the clusters do not closely correspond to these electronic identities(Block 1420) the actor has only one personality (Block 1410). However,if there is at least one cluster that corresponds closely to at leastone but not all of the electronic identities, each such cluster will beconsidered a new personality (Block 1425). Assuming that not all of thecontent is covered in this way, remaining clusters will be rounded up(Block 1430) into a single personality. For one embodiment, eachpersonality will be labeled sequentially according to the amount ofcontent produced by them (Block 1435). This is measured by total numberof items created in some embodiments, while in other embodiments onlyproduction of “contentful” documents 505 is counted. “Contentful”documents 505 are those that contain a substantial amount of text apartfrom any that was present in the creating template. In the event thatthere is a tie, the primary work-related personality takes precedenceover all others. In another embodiment, change in spoken language used,or change in register, will by itself be considered evidence of adifferent personality, when correlated to different electronicidentities. In yet another embodiment, null or near-null intersectionsin topics discussed (as determined by ontological filters,) or in theactors or personalities communicated with can also be used independentlyor conjunctively as evidence of different personalities. The processperforms this analysis for each actor (Block 1440).

Personalities are useful both as an input parameter to the ontologyfilters that are described below, and as a means of providing negativeevidence to the discussion building engine for including content frommore than one of an actor's personalities in the same discussion.Otherwise put, if an actor has very carefully segmented her differentpersonalities, the system will do the same. However, queries performedon the actor will return results from all of their identifiedpersonalities. This may optionally be broken down by personality.Copending application Ser. No ______, entitled “A Method and Apparatusto Visually Present Discussions for Data Mining Purposes”, filedconcurrently herewith discusses the different presentation methods ofsuch data.

“Circles of Trust”

FIG. 15 is a block diagram of one embodiment of circles of trust orclique class hierarchy and associations. Circles of trust 1505 are setsof actors 310 who consistently correspond in a closed loop with oneanother. Circles of trust 1505 may be associated with only very specifictopics, or just be a set of close confidants who regularly communicateamongst themselves on a variety of topics. Circles of trust 1505 areinitially calculated using the thread sub-graph (T-colored edges) sothat they may used as an evidence source, and then are recomputed as aby-product of discussion building across the corpus. However, oncecalculated, they are considered a derived property of the actor. This isbecause the answer to the question “who do they talk to” is often afundamental one in an investigative or litigation context. Shifts inmembership of circles of trust 1505 can also provide critical context inthis regard.

Circles of trust 1505 will be identified by the system using heuristicsdescribed in the following paragraphs. Circles of trust 1505 can be ofvarious types:

-   -   Crisis or event-motivated Circles of Trust 1520    -   Professional Circles of Trust 1510    -   Friendship-based Circles of Trust 1515

Crisis or event motivated circles of trust 1520 are circles of trust1505 that only manifest themselves in response to certain kinds ofstimuli, for example rumor-mongering about a merger. FIG. 16 is aflowchart of how crisis or event motivated circles of trusts 1520 areidentified. They are derived in the following way:

-   -   Perform a frequency per unit time analysis (Block 1605) on all        communications between pairs of actors with edges connecting        them in the actor graph.    -   Identify any bursts (Block 1610) in the activity level of        communication between these actor pairs, and the time intervals        (Block 1615) in which these bursts occurred. If no bursts are        found (Block 1610), then the process ends (Block 1635), and the        system determines that there is no crisis/event based circle of        trust.    -   Any totally connected sub-graphs (Blocks 1620, 1630) within        those bursts (i.e. the set of actors connected by GB-colored or        PB-colored edges, where each edge links communications within an        activity burst covering the same interval of time) will be        considered a circle of trust if it occurs in more than one        interval of time. For any such subgraph add CL-colored edges        between each pair of member vertices. Discrete intervals of time        are determined by the communication behavior of the actors in        question; each burst is centered in an interval of time, and        modeled to a standard distribution. In one embodiment, if the        curves of two bursts do not intersect within their respective        second deviations from their centers, they are not considered to        lie in the same interval of time. If there are N>2 number of        occurrences of such bursty behavior, the system will attempt to        correlate the start of the burst with events it is aware of        (Block 1625) based on the actors 310, the topics 310 which        occurred within a user-configurable time preceding the start of        the burst. If the bursts meet the criteria, the crisis of event        motivated circle of trust is identified (Block 1640).

Note that in general this is an NP-complete problem, however it istractable with sparse graphs. Further, this pairwise heuristic increasesthe sparseness of the graphs.

Professional circles of trust 1510 operate on an ongoing basis, butoften tend to be largely restricted to a limited set of topics, whichare generally work related. In this case, similarly, the system searchesfor totally connected sub-graphs in the colored graph based ontopic-limited communication among actors, e.g. TC-colored links.However, as “burstiness” is not a factor here, a clustering analysis isperformed in order to cluster actors and communications on known topics.In the absence of a complete ontological covering of the corpus,standard lexical similarity measures may be substituted. Any totallyconnected sub-graph of TC-colored links is recorded by adding CL-coloredlinks between vertices in the subgraph; these clusters are identified asprofessional circles of trust 1510. Note that in some embodiments of thesystem, shortages in the lifespans of different actors relative to thetime period in question will be accounted for by artificially boostingtheir communications count proportionally to the amount of time theywere absent during the time interval being considered.

Friendship-based circles of trust 1515 are similar to professionalcircles of trust 1510, but are not topic-limited; by their nature, theactors communicate on a wide variety of topics—whatever happens to be ontheir minds at the time.

Note that in addition to circles of trust 1505, there is also the notionof chains of trust. These are cases in which trust is not transitive.For example, Actor A IM's Actor B with a secret of some kind. Actor Bnow forwards this information to Actor C, without the knowledge of ActorA. Specifically, there is pairwise communication amongst actors, andpossibly N-ary communication among actors, but only for values of N lessthan the number of actor in the chain. The system identifies such chainsof trust, and considers it primarily as an anomaly worth detecting. Thisis discussed in more detail below in the section on anomaly detection.

As part of this calculation, in one embodiment, unusual troughs inpairwise actor communication is also noted; bursty behavior and extremelack of communication fall on opposite ends of the distribution curve(more than 3 sigma out.) As the system looks for bursts of communicationamong related actors, it also looks for related troughs. Troughsspanning major holidays are automatically removed; if an HR system orcalendar is available to provide vacation and other planned absenceinformation for the actors in question, troughs spanning these timeperiods are similarly discounted. Finally, actor heartbeat frequency (aform of measuring or auditing electronic activity defined in asubsequent section,) during the time interval in question is examined.If there is no heartbeat during this period, the actor is considered tohave been absent. If the heartbeat is reduced, the amount ofcommunication with all like actors (in the sense of being members of thesame equivalence class, for example members of the same organization,)is expected to be reduced proportionally for all such actors. If thisexpectation is violated, it is considered to be a trough ofcommunication involving particular actors, rather than just a globaltrough in activity due to a deadline or other outside force. Such troughinformation is queryable through the system by providing it a set ofactors, and optionally, a timeframe.

Actor Importance

In one embodiment, there is the notion of actor importance. Importanceis determined by the following factors:

-   -   Position in the organizational hierarchy    -   Role or job title in the organization    -   Number of pivotal items associated with them (see discussion        below)    -   Number of resolution items associated with them (see discussion        below)

In one embodiment, the importance score of an actor is raised by 1 foreach pivotal or resolution item he generated. Actors at the lowest levelof the organizational hierarchy are assigned an importance of 0; at eachsubsequent level of the hierarchy, an additional 10 points are added. Inone embodiment, if an actor appears in more than one level in thehierarchy, the highest level is used for the score. The user specifieshow to score various job titles or roles within a particularorganization. In one embodiment, the system will try to infer theimportance of certain roles on the basis of their being approval/denialpoints in a workflow, as will be discussed in more detail below. Notethat since any of these factors may change for a given actor—somefrequently—importance information is periodically re-calculated incontinuous or incremental versions of the invention. In someembodiments, actor importance is averaged across the lifespan of theactor. In others it is segmented; that is, the importance of the actoris calculated on a per message or per time interval basis in order toaccount for the fact that actor importance is dynamic. In general,actors become more important the longer they are around; theyparticipate in more discussions, and are generally likelier to increasein organizational rank. The primary usage of actor importance is inranking the results returned in response to a query.

Documents

All data input to the system is in some sense considered a “document.”However, in general only the post-processed form of these raw documentsis used by the present system. For example, archival formats thatcontain individual documents such as mail messages are exploded prior toconsideration by the sociological context engine. Those items that donot end up being incorporated into discussions retain document status.They are also frequently referred to as “singletons.”

The following is a list of commonly used document attributes. Additionalattributes may be added for special applications, or for specialdocument types. (For example, as previously noted, all headerinformation is retained in the case of email.) However, the followinghave general applicability.

Document Attributes

-   -   Title: extracted where available from document meta-data.        Otherwise in some embodiments null; in others the first sentence        or contents of first field in a fielded document. In still other        embodiments, the user may provide a name for the document if it        would otherwise have none.    -   UID: A unique ID code generated by the system. In one        embodiment, the UID contains a hash of actor and timestamp of        earliest occurrence (of the de-duped instances.)    -   Creation Date: The earliest date found associated with any        instance of the document found during the initial de-duping        phase.    -   Revision: Current revision number. This is taken from the        version provided by an outside document management system if one        exists. If additional versions of the document are found that do        not correspond to what is in the document repository, the system        will increment the minor version number in such a way as to not        conflict with any version number formally assigned by another        system. For example, if a document repository lists a 4.1 and a        4.5 version of a document, but the system identifies versions of        the document that occurred in time between these two versions,        they will be sequentially numbered using equal increments        between 4.1 and 4.5. In the event that additional such versions        are discovered later which conflict, a similar scheme is used to        insert them into the numbering sequence. Finally, if a document        has never been checked into a document repository, the system        will number it sequentially. Each content change results in the        version number being increased by; each meta-data change        (including user annotation) bumps it by one. Note that        communication documents such as emails, IMs, SMS messages, or        any type of “exactly once” document do not have the notion of        revision. For such objects, both this field and the next have a        value of null.    -   Revision History: An arbitrary number of records of the form        <actor|timestamp|change description|version number|version        controller>. The “change description” is either the check-in        message, or the contents of the email associated with a soft        revision. The version controller is either the ID of the source        control or document repository assigning the system or an ID        representing the invention, if it is assigning a “soft” version        number after the fact.    -   Distribution History: An arbitrary number of records of the        form: <dist_actor|recipient_actors|timestamp|version        number|distribution event type>. Dist_actor is the ID of the        actor responsible for the distribution event. Recipient_actors        can be one or more individual or aggregate actors , or null (in        the event where no clear recipients can be identified, such as        posting something on the internet.) Distribution event types        include, but are not limited to: published, posted, produced,        sent, converted (to other format,) and submitted.    -   Strictly Versioned: 1 if no “soft” revisions exist; else 0. A        “soft” revision is one that is found by the system but which        does not correspond to a check-in event in some kind of version        control or document management system.    -   Presumed Author: Either the valid author listed in the document        meta-data if present, or if this actor is determined invalid by        the ownership assignment process, the actor to whom primary        document authorship is attributed by the system.    -   Other Authors: List of IDs of actors who have modified the        document at some point.    -   OCR: 1 if this text was extracted from an OCR system, else 0.    -   Reconstructed: 1 if the document was partially or totally        reconstructed by the system, else 0. (An example of this would        be an email that had been reconstructed on the basis of the        inclusion of its text and partial header information in a        subsequent message.)    -   Topics: The set of topics identified in the document as a result        of ontological analysis. Or statistical topic analysis, or any        combination of these methods.    -   Named Entities: The set of named entities appearing in the        document. This includes, but is not limited to: actor names,        document titles, and geographic locations.    -   Natural Language: The spoken language(s) in which the document        is written.    -   Pragmatic Tags: The set of pragmatic tags that apply.    -   Template: The ID of the template used to construct the document;        null if there is none.    -   Is Template: 1 if the document is a reusable template, else 0.    -   Document Type: Possible values include, but are not limited to:        email, database, regular document, spreadsheet, task list, and        project schedule.    -   Created by Application: The application which was used to create        the document, for example Microsoft Outlook or Word, Netscape        Messenger, Excel, etc. This is generally determined by file        extension, however some embodiments of the invention do scan the        initial bytes of a file in order to validate the type suggested        by the document extension.    -   Related Event: ID of the event that the document is linked to,        if there is one. For example, a document might be the meeting        transcript of a conference call.    -   Content Fingerprint: This contains the set of records        traditionally used in indexes. This includes, but is not limited        to, depending on embodiment: an inverted frequency index of        stems, proximity tuples of such stems, and term density        information.        Document Similarity

Once there is a reasonable cut of uniqued actor information, the systemmay return to processing documents. The next step in this processing iscreating a lineage for each document; performing the digital archaeologyprocess of determining which documents are either different versions ofthe same document, or which had forebearer documents which contributedsignificant content to them. In each of these cases, the system willattempt to reconstruct the derivation tree as faithfully as possible.Two main cases of interest are considered:

-   -   Reusable document templates    -   Free-format documents        Reusable Template Documents

Many documents created in corporations are based on templates of somekind. These templates can vary from being fairly simple forms such asexpense reports and vacation requests to very long and complex contractboilerplates. However, what they all have in common is that apart fromdate and entity information, the majority of their content is the same.Further, and more importantly, the relative location of content thatchanges is almost always the same. For example, while the number ofcharacters in the changed content may vary, the markers on either sideof it will be the same. Templated documents may in some instances alsobe identified from the document meta-data, which specifies the name ofthe template used to create the document 505.

Reusable template document information is used primarily by the systemin order to determine when workflow processes are occurring. Suchdocuments are presumed to have no special relationship to one another bydint of using the same template. For example, in a large company, in thecourse of a year, many people will file vacation requests, but the factthat they all used the same template to do so does not make thedocuments related to one another. However, a document based on areusable template (a templated document) may have different versionsbased on changes being made to the non-templated portions. In thisevent, such templated documents may be considered related; this will bediscussed further below. Apart from this exception, templated documentsare not considered to be so-called “contentful” documents; documents inwhich significant new content has been authored.

Free-Form Documents

Other documents are created totally from scratch, or largely fromscratch but with some borrowing of content from other documents. Some ofthese may contain substantial content from other documents, while othersthough written totally from scratch, may appear surfacely similar toother existing documents. As will be seen, the system distinguishesbetween these two cases, since in the former case there is an ancestralrelationship between the documents involved, while in the latter casethere may be no relationship whatsoever between the documents. Forexample, many press releases announcing new products sound very similar,boasting that their offering is “cheaper faster better” regardless ofwhat the product may be or who its manufacturer is.

All predominantly text “office” documents are considered by the systemto be either free-form or template based. (Essentially, this is allregular documents except for fielded documents.) Communications such asemails and IMs are excluded from this categorization, as are allstructure documents, such as databases, spreadsheets, or otherrecord-containing files. The former are not considered to have versions;the latter do, but are not amenable to the following technique, and soare treated differently.

Documents that were identified during the initial indexing process asbeing of the non-fielded “regular document” category are now re-indexedat the level of individual sentences longer than, in one embodiment, 5words (Block 1780), and paragraphs (Block 1785). During this process,repeated elements are de-duped, but a reference to the UID of eachdocument where the segment occurred, and its relative position in thatdocument is added to the index. Optionally, the user may configure thesystem to index at even lower levels of granularity, for example at thephrase level, or creating contiguous segments of N words each, the valueof N being specified by the user. In addition, sampling may be used tolimit the number of individual units being indexed if desired.

Given this index of segments, the system creates a clustering of thedocuments based on a count of segments common to two or more documents.FIG. 17 a is a flowchart of one embodiment of determining documentsimilarly. The procedure is as follows:

-   -   Sort the above-described records in descending order of number        of non-null fields (Block 1705); this will cause the most        frequently-occurring sentences to appear at the top of the        index.    -   Starting with the first record in the sentence index and        proceeding sequentially (Block 1710), the system creates a        S-colored vertex (Block 1715) for each document.    -   Similarly, it creates SC-colored links connecting each pair of        S-colored vertices that appear in the same record. If an edge        already exists between 2 vertices, increment the weight of the        edge by 1 (Block 1720). If there are no more records in the        sentence index (Block 1725), the process continues to block        1730, otherwise, the process returns to block 1710, to the next        record in the sentence index.    -   Repeat this process, but this time using the index of        paragraph-sized segments (Block 1730). Augment the weight of        each edge (Block 1740) between documents by 10, for every case        where a paragraph containing more than 3 sentences occurs. Note        that content in headers, footers, or signatures are excluded        from this rule, when they can be determined by the system. This        scheme will have the effect that content-heavy templates and        different versions of the same document will have the highest        scored links; less text-heavy templates and documents that        “borrow” substantially from other documents will also have        noticeable link weights. Anything else should have only        negligibly scoring links attached to it. If there are no more        records in the paragraph index (Block 1745), the process        continues to block 1750, otherwise, the process returns to block        1730, to the next record in the paragraph index.    -   Next, any edges are eliminated which have weights below a        certain significance threshold (Block 1750). The intention of        this is to remove linkages that are in effect random. In one        embodiment of the invention, any link with a weight less than 10        is removed unless the length of the document is less than 100        sentences.    -   Now the system applies a hierarchical clustering procedure        (Block 1755) to the documents that are still linked into the        graph.    -   The result is a hierarchical clustering of documents. The lowest        level of the clustering defines near-duplicate documents.        Replace SC-colored links between vertices in these clusters with        CR-colored links. All documents in such a cluster will be        defined as “near-duplicates.” The higher levels of the        clustering define groups of documents that are related by the        presence of common spans of text and retain their SC-colored        links.

As the overall running time of this process can be a limitation in manycases, especially where the graph involved is not sparse, in anotherembodiment of the invention, the so-called greedy set cover heuristic isused instead. It allows distinct but close document sets to be collapsedinto one during the counting phase. The Greedy Set cover is robust butis nevertheless an approximation of the minimal set cover. This alsoavoids incrementing weights 0 (s*d*d) times. Standard and custom localsearch techniques can be applied to improve the quality of thisheuristic. Another embodiment, also performs n-gram analyses, and/orcreates suffix trees, and use this as another dimension to provide tothe clustering analysis.

The above analysis is based purely on lexical content. The system nowmatches actors up with these documents. FIG. 18 is a flowchart of oneembodiment of actor involvement determination. The system examines eachof the following data sources:

-   -   Emails in which any of these documents were emailed from one        actor to another.    -   Which machines or media the documents appear on (prior to        de-duping,) and the actors who “own” the machines or media.    -   Check-in and check-out logs from document management systems    -   Publication or posting dates to an intranet (or to the internet)

For each document the following information is retrieved and placed in alinear sequence according to time:

-   -   Email: From|To|cc|bcc|Date and time stamp|inline message content    -   Copies: {list of actor IDs corresponding to machines or media        considered to be owned by them}    -   Logs: Check-in or Check-out|Actor ID|Date &        Timestamp|Notification list (actor IDs)|check-in message text    -   Publication/Posting date: Actor ID|Date & Timestamp

This data is used as evidence of actor “touch” and is extracted for thecurrent document (Block 1805). In addition, actors may have importantconnections to the document that are not associated with a particulartimestamp. These include, but are not limited to, authorship (asreflected in the document meta-data, or as corrected by the system,) andmention in the document contents. Such associations are added to thecount of actor involvement occurrences, though they will in most casesstrongly correlate with the timestamped information.

As noted earlier, document meta-data will not always provide a correctauthor, or in some cases, any author at all. If more than one actor IDappears in these records (Block 1810), the system resolves this in thesecases by assigning the author/owner on the basis of which machine theearliest known version of the document appears. Some embodiments mayinstead count the number of actor “touches” of the document as describedabove (i.e., emailing it around, checking it in to a document managementsystem, etc,) and assess ownership of the document to the actor with themost touches of the document. Other embodiments may combine these tests,and assign more than one author or owner as appropriate, or assigndifferent owners over different time spans, if the actor activity of thepersons in question does not overlap in time.

Note that inline text in email accompanying document attachments is bydefault treated by the system in the same way as a check-in message. Thesystem now pulls the document title information, and uses this as athird dimension to provide to the clustering analysis; the clusteringanalysis is now performed again.

From this data, a set of actor IDs corresponding to actors who had someclear involvement with the particular document may be obtained. Workingupwards from the lowest level of the clustering, intersect the set ofactors involved with each document within the cluster that has more thanone actor associated with it (Block 1845). If the set is non-null, thecluster is presumed:to correspond to different versions or variations ofthe same document (Block 1855). If the set is null, the cluster ispresumed to correspond to a reusable template that is in fact beingreused by a number of unrelated people (Block 1850), in these casesreplace the original edge (SC or CR color) with a TM-colored edge. Ifhowever, there are hierarchical clusters of templated documents, this isinterpreted by the system as evidence that multiple, distinct,significant sets of new content were added to the template (Block 1855).Note that for versioned documents, any document information provided bya document repository is considered to be hard evidence that overridesany other evidence type. For example, if a document management systemrecords that document A and document B are respectively versions 3.1 and3.8 of the same document, this will be asserted as a fact by the system,even if the clustering analysis contradicts it. The actor IDs are thenuniqued, and identified as the actors involved with this document (Block1860).

In some embodiments, in the above process the number of interactions anactor had with the particular document will be considered; in suchembodiments, the above-described intersection process only considersactors that had multiple interactions with the document.

Next, the actor sub-graph is updated to reflect this initialcollaboration information. A CO-colored edge is added (or edge weightincrementally adjusted) between each actor appearing in conjunction witheach other document on the same document or on any document in the samecluster where the actor intersection set is non-null. This informationwill be further refined after discussion building. However, adding aninitial cut at this information prior to discussion building makesavailable useful evidence for the process, namely which actorscollaborate together in order to create content.

FIG. 19 illustrates the iterative nature of building an actor graph. Ascan be seen, the process is iterated multiple times, to obtain anaccurate actor graph.

Textblock Identification

Textblocks consist of the maximum contiguous sequence of sentences orsentence fragments which can be attributed to a single author. Incertain cases, especially emails, a different author may interposeresponses in the midst of a textblock. However, the textblock retainsits core identity for as long as it remains recognizable. This is usedfor a variety of important applications, such as redaction.

FIG. 20 a is a flowchart of one embodiment of textblock identification.

-   -   Tokenize each sentence (Block 2005) with a search engine        tokenizer. This defines individual words. (Note that tokenizers        may have special rules to consider such items as AT&T, I.B.M.,        joe@doe.com etc. as single words.)    -   Consider the next word (Block 2010). Scan the inter-word        characters for evidence of sentence boundary (Block 2015). If a        punctuation character other than a comma occurs between two        words the system places an end of sentence marker between the        words (Block 2025).    -   If the running count of words is over N (Block 2020) also put an        end of sentence (Block 2025). This is important for certain        writing styles (informal email, poetry) and for content        potentially cleaned from its punctuation by previous tools. In        one embodiment of the invention, N is set at 50; another        embodiment may use a different value, or allow the user to        choose it.    -   When a termination marker including, but not limited to, new        line, paragraph, or change in quote marker in email (see below)        is found (Block 2030), truncate the textblock (Block 2035). Any        document meta-data, such as that provided by a change tracking        mechanism, is similarly handled.    -   Add all sentences to a special inverted index which also        provides reference to their textblock of occurrence. For space        efficiency, the sentences may be hashed before storage.        Determine if there are further records (Block 2040), and if        there are, return to block 2010.

Structure, or other fielded documents are handled separately, as one ofthe key dimensions of analysis for derivation, continuous lexicalanalysis, is not applicable. Instead, different mechanisms are definedfor each main type of such document. For example, log files can beeasily versioned based on the timestamp of the last event they contain.However, spreadsheets may have similar column and row names—as manydo—but very different numbers in the body. The difference could be oneof a single formula change, or it could be one of hundreds of individualchanges. Thus, one embodiment of the invention has a special mechanismwhereby the formulas from different similar but unversioned spreadsheetsare applied to one another in an attempt to determine ancestry.Generally in the case of such documents any change tracking informationis considered to be valid, as is that from any document repository.

Email is specially treated, due to the fact that a message will oftencontain the contents of previous messages, and so project falsesimilarities. Thus, the system recognizes quoted text, using textblockattribution. FIG. 20 b is a flowchart of one embodiment of textblockattribution.

The system assumes that emails contain text that was authored by one ormore actors. Because of the frequency of the “more than one actor” case,the system attempts to attribute an author to each of these segments. Itwill thus perform the following operation on emails, starting with thosethat exist within an email thread. In one embodiment, all emails thatare part of a thread are examined sequentially within the context of athread, starting with the first element in the thread. The process is asfollows:

The textblocks at depth zero are attributed to the sender of thecommunication (Block 2060). Note that in this step, and indeed also inthe following steps, once a textblock has been assigned to a particularactor ID, this same textblock will be attributed to this actor ID forthe rest of the given email thread. This is critically important, sinceit is information that can be used to reconstruct individual emails inthe thread that may no longer be independently present in the corpus.

-   -   Textblocks at a greater depth are attributed based on        decorations found within or immediately before the textblock.        Textblocks at depth 1 are presumed to be authored by one of the        recipients of the message. (If there is only one actor on the        to: line, this actor is the presumed author of text at depth 1.)        -   One type of decoration is a “quote introduction” line (Block            2065). The system looks for patterns which include, but are            not limited to: “ˆ(\s)*On xxxxx ([:w:] [:w:]) said:“\n(\n)?”            For example this may be a match in the textblock “On Apr. 1,            1991 John Smith said:” The system will also extract            customized quote lines that an actor may have created in            those email applications which support this feature. The            name portion of the match is used as actor evidence for an            actor model lookup. If it succeeds, it is used as evidence            that the immediately subsequent textblock should be            attributed to that actor.    -   Another type of decoration is an inline signature block (Block        2070). Note that this can only be applied if the signature was        left in the quoted text. In the event that it is, it is used as        evidence both to attribute the textblock that includes it to the        actor whose name appears in the signature, and further to        attribute all textblocks at that level (for example, with        equivalent indent or quote marker) to this same actor.    -   The system also extracts sender and recipients in the upward        closure of the containing email thread (Block 2075). This        information is used to fill in any missing author identities.    -   Each textblock is assigned a unique ID which may be used        anywhere the same textblock appears (Block 2080). This ID        includes a hash of the ID of the actor who authored it. (Note        that in addition to automated inclusion in replies and forwards        to a message, a textblock could get included in an email by the        user copy/pasting it. Hence, the system does not require a        forward or reply in order to identify a textblock in a        communication. However it does require that it be the same actor        responsible for it.)

Note that textblocks are correctly identified even after other actorshave decorated them with additional characters because such text isidentified by the above process as being a foreign or subsequentinsertion. The more difficult case is the one in which a subsequentactor has snipped some text out of the original textblock, rather thanjust adding additional characters. In one embodiment, if a singlesentence greater than the larger of 5 non-stop words or 50 characterssurvives completely, the textblock is still identified. Otherembodiments may take different approaches, including, but not limitedto: only allowing a fixed number of trailing characters in the textblockto have been deleted, or requiring that the same ontology classes mustbe triggered in the descendant textblock as in the original. Some ofthese embodiments may take the presence of a foreign insertion asevidence of the textblock being present; others may not. Once thetextblock has been so identified, its ID is attached to each document inwhich any form of it appears. In general, if a texblock appears in twodocuments that have been attributed to different actors, the earlierappearance determines the attribution. (Block 2085).

The present invention supports all methods of representing quotes withina message. These include (but are not limited to):

1. The insertion of a character to the left of the text being quoted. Anexample follows. In the example, the quoted text appears above the newtext, but the present invention also accepts quotations that appearbelow the new text, or messages that contain a mixture of quotationsabove and below the new text, as follows: >>This text is the quotationof a previous quotation, as >>can be seen from the repeated quoteidentifier. >> >This text is quoted from the previous e-mail in thereply >chain, as can be seen from the single quote identifier. > This isthe text of the current message

-   -   2. The use of a block quote mechanism in which the quoted text        appears below the new text in a block that may or may not be        indented. The present invention places no restrictions on        whether or not the block is indented, and is flexible with        regard to the headers that appear at the top of the quotation,        or any heading that may appear above the quotation        (“-----Original Message-----” in this example, which is taken        from the default quotation settings of Microsoft Outlook 2000™)    -   Here is the new message text, which is not automatically        indented. The quoted text appears below.        -   -----Original Message-----        -   From: Fred Smith [mailto:fred@blah.com]        -   Sent: Tuesday, Nov. 19, 2002 15:36        -   To: ann@blah.com        -   Subject: RE: Meeting        -   Today's meeting has been postponed until 3 pm, but will            still take place in the conference room.

Fred

As noted above, other types of common quotation are supported by thesystem including: “ˆ>>”, “<blockquote>”, “z,900 Original Message:-----”These may be extended by the use of regular expressions. The system canbe extended to recognize any other such delimiters. The user may alsoadd additional quotation indicators.

Lexical or topical similarities are only applied to emails during thediscussion building process, at which point all messages in the samethread will be available, assuming that they still exist in the corpus.This is to prevent emails that were part of the same thread, and hencecontain the contents of the previous message, to be incorrectlyoverscored on these measures. IMs, due to their typically short length,will also be analyzed for similarity at a later stage.

Textblock Identification in Regular Documents

In regular documents, there are fewer cues as to content created by anauthor other than the primary one. However, by using the actorinvolvement information computed earlier, we may capture the changesmade by each actor who made a change (presuming that their version ofthe regular document is still present in the corpus.) This is done byusing any commonly available package or method to “diff” the documentwith other versions close to it in time or version number. Any new textappearing in the diff is attributed to the actor who performed any ofthe above listed actions, for example sending the modified regulardocument to other actors, or checking it into a repository.

In the event that the same textblock appears in two or more documents505 that have been attributed to different actors 310, the earliestappearance of the textblock determines the actor 310 it will be assignedto (Block 2085). Note that this requires performing a query against theset of duplicate items in order to ensure that the true earliestinstance is considered.

Note that whether a regular document has been OCR'ed is importantbecause OCR'ed content is considered less reliable than nativeelectronic content. In some instances, the same regular document mayappear in a corpus in both native and OCR'ed format. Because ofmeta-data and document type differences, such documents will not beconsidered duplicates by the system during the initial indexingprocesses (and similarly would not have been had the data beenpre-indexed prior to submission to the system.)

FIG. 20 c is a flowchart of one embodiment of textblock identificationwithin an OCRed document. Documents input through the OCR system (Block2086) are flagged. However, any regular document flagged as “OCR” willcause the system to perform a modified sequential textblock analysis inwhich all textblocks must match but a 10% variance in individualcharacters is allowed, with the proviso that the differences must berandomly distributed throughout the textblock. Some embodiments mayallow a variance of up to 30% (Block 2088). Further the number ofalpha-numeric characters in each textblock must match exactly, oneembodiment. Alternative settings for declaring a match may be set by theuser. For another embodiment, the user may declare a document matching,even if these criteria are not met.

If the OCRed document does not match an existing documents within theconstraints (Block 2090), the document is added to the repository (Block2092). If the OCR'ed copy of the regular document is a duplicate of thenative electronic regular document, the process, depending on the user'spreferences can: remove the OCRed document from the data set, orindividual textblocks from OCR'ed regular document may be verified bynative electronic versions, should they exist. In this event, the OCR'edtext is corrected (Block 2094) using the data from native electronicversion. In order to prevent short similar textblocks from beingconfounded, a number of words minimum is imposed for textblock length.This minimum can however be disregarded if two contiguous textblocksmatch. In one embodiment, this threshold is 50 words. In others, itsvalue is determined by the user

Event Types

While “contentful” documents or communications are the richest source ofdata, events 1525 may also be of great utility in either contextualizingor interpreting the content-bearing data. For example, if there is ameeting among the same set of actors that have recently been engaged inan extensive exchange of email on a particular topic, it is quitepossible that this meeting also pertains to the same topic. Further, aspreviously noted, if a content-bearing document contains a reference toa meeting and a date and time, as well as possibly a list of attendees,this is considered further evidence that the meeting was indeed relatedto the prior discussion. Other types of events 1525 may be entirelyexternal to the corpus itself, but may be interesting from a contextperspective, as outside events 1525 do influence internal ones. FIG. 21a is a block diagram of one embodiment of a partial event hierarchy.

Events 1525 are inherently pieces of data that are content poor, withrelation to content indigenous to the actors in the corpus. Most event1525 types have some text associated with them, but not text that fullydescribes the event 1525. For example, a phone record might contain thenumber dialed from and to, a duration, and a date and time. This iscertainly information in textual form, however it does not provide adescription of the contents or purpose of the call.

Internal event 2102 types involve actors in the corpus performing someaction, either with other actors in the corpus, upon content, or withthe outside world. As indicated below, some of these event 1525 typesare merely used to decorate discussions. However, others may be integralparts of discussions. For example, a meeting may be where the resolutionto a discussion occurred, even if the system lacks the information tosay what that resolution was. However, this information can still beused to question deponents in an investigation or litigation context.

While internal events 2102 by definition do not contain their own fulldescriptive content, instances of some event 1525 types may haverelationships with specific content-bearing documents. For example, ameeting 2112 on an online calendar 2108 may have meeting minutesassociated with it. Or a conference call 2114 noted on a calendar mighthave a complete transcription of it in a document. In those instanceswhere the system can detect this relationship, in one embodiment, acombined document is created containing both the event 1525 information,and the document content.

The list of internal event 1525 types in one embodiment of the inventionincludes the list below. Internal events 2102 can generate two kinds ofevidence in the discussion-building process:

Sustenance: There is evidence that the actors involved in the possiblediscussion are either communicating opaquely with each other (forexample, via a phone call for which there is no transcript,) or areoperating on objects being discussed contemporaneously, (for example,editing a report.) Sustenance evidence interposes more items in the setof items being considered for inclusion in a discussion. As a result,the interval of time between any time-contiguous communications isshortened, increasing the probability that these items are related.

Hint: A “hint” is any kind of evidence that is of the general form thatthe presence of X requires Y to have occurred before it, after it, or atall. For example, the presence of a purchase order approval requiresthere to have first been a request made. Workflows 2110 are an importantsource of hints. While such information is used in discussion 305building, such “hints” are a critical source of reconstructioninformation.

Note that the user may customize different event types and fornavigational and querying purposes, even create hierarchies of them. Forexample, a “board meeting” is a particular type of meeting request; a“full board meeting” is a particular type of meeting, and so on.

Internal Event Types

-   -   Calendar events 2108 (i.e. tax day, end of quarter)    -   Electronic meeting requests. Such meetings may be where the        resolution of a discussion occurs. It is evidence that there is        continued communication among the actors in the discussion. The        presence of such a meeting inserts a new item with the date and        time stamp of the meeting time.    -   Employee lifecycle events 545 (promotion, transfer, termination,        etc)    -   Document activity 540 (modification, repository        checkin/checkout, creation, deletion, etc.) Such activity is        evidence that work is being done on particular documents. At        such point as this work manifests itself in a newer version of        the document being sent out to other actors via email, it is        considered to be part of an informal workflow process and is        considered an event in the discussion, one that inserts a new        date and timestamp. When as a result of a check-in to a        repository, the change is being made available to other actors        the system treats it in the same fashion.    -   Telephone records and telephone messages/voicemail 2114. Under        certain circumstances, these will be considered as weak evidence        (e.g. sustenance) in building discussions. See below.    -   Wire transfers and financial transactions 2106    -   Workflow events 2110. Workflow systems are considered providers        of strong evidence in building discussions. Ad hoc workflows        2120 are considered a source of weaker evidence in one        embodiment. In another embodiment, only formal workflow events        2122 are considered.    -   Accounting entries/activity (i.e., bill sent, check cut)    -   Task lists (e.g. an item being added or check off as completed)    -   Records of packages shipped with carriers who use electronic        tracking systems        External Event Types

In one embodiment, external events 2104 are used for decoration only, tohelp the user recreate the context in which the content was created.That is, they do not generally contribute evidence to the building ofdiscussions 305, but rather are added by the system after the fact.

The main types of external events 2104 are as follows:

-   -   External events (for example, a sharp stock price drop,        earthquake)    -   The publication of external articles

External events 2104 can be attached to discussions either manually,through a third-party application that works programmatically via an APIor object model, or automatically. If added automatically, an event willbe added to all discussions whose lifespans intersect with the event intime, all discussions that were active at the time of the event, or onlydiscussions containing certain topics as determined via the use ofontology classes or statistical topic analysis methods.

Many events will never end up in a discussion. However, prior to thestart of the discussion building process, events that are potentialcandidates for inclusion in discussions are extracted and placed in adata store. In order to be a potential candidate for inclusion, an eventmust have either an actor associated with it that corresponds to aconnected node in the actor graph or be associated with a document thathas been passed through the previously described document lineageprocedure. External events, and internal “calendar” events 2108 whichare unrelated to specific actors are an exception to this rule. In oneembodiment, rules for inclusion of these event types are specified bythe user.

Event Attributes:

-   -   Name: Some events may have titles associated with them; for        example, a meeting request often will. Others, such as phone        calls extrapolated from phone records, will not. In the former        case, the existing title is used. In the latter, the system will        create a title. For one embodiment, the title is created by        concatenating a form of the event type name and the actor        information. For example, in the case of a phone call, the        system generated name for the event would be: <actor name>        calling <actor name>.    -   UID: A unique ID generated by the system for the event. In one        embodiment, the unique ID includes a hash of actor IDs, date,        and start time.    -   Actor IDs: The IDs of any actors associated with the event. This        is determined by explicit mention of an actor in the event title        or description, inferred on the basis of ownership (for example,        if an event is extracted from an actor's personal calendar, it        is automatically attributed to them,) appearance of the actor's        phone number in a phone record (if such number can be uniquely        attributed to them,) and so on. Note that aggregate actor IDs        are valid, for example a calendar might be owned by a group        rather than an individual actor.    -   Start time: The start time, or presumed start time (based on the        available evidence) and date of the event.    -   Duration: The duration, or presumed (that is, the scheduled        duration if no actual duration information is available)        duration, of the event.    -   Event type: The type of the event. Note that the user can create        their own custom event types, and can also create hierarchies of        event types.    -   Related Document IDs: In the case that an event does have one or        more documents associated with it that largely or completely        defines its contents, such document IDs. Such relations are        detected by the system by pragmatic analysis, for example        searching for content such as “transcript of meeting on <date>        between <actors.> The system will also allow the user to        identify such relationships, either on an instance or template        level. Note that meta-documents, documents that were not part of        the corpus, but are subsequently added to the system by the        user, may be specified as being related to an event. An example        in which this might occur is one in which a deponent provides an        account of a telephone call.    -   Description: Some event types may optionally have descriptions        associated with them. These event types include, but obviously        are not limited to: check-in messages, and meeting requests. If        such a description exists, it is stored in this field. If not,        its value is null.

These are the base set of attributes. However some event types may haveadditional attributes. For example, in the case of a wire transferevent, such additional attributes would include the amount of the wiretransfer, and the relevant account numbers.

Next the system performs an initial iteration of the analysis of theinteraction patterns among actors. In previous processes, the system hasalready established the amount of both communication and collaborationbetween the various actors. However, until this point, the temporaldimension to the interactions has been ignored. At this point in theprocess, this information is considered. Specifically, the system isdesigned to obtain mean response times between pairs of actors tocommunications. This process includes the following concepts:

Actor Heartbeat: Every time an actor performed an electronic action thatis externally auditable, for example sending an email, logging into adatabase, etc, is treated as a “heartbeat.” It means that the actor wasat work, and available, and hence theoretically capable of replying to amessage or performing some other action. To this end, the timestampsfrom all communication documents from each actor are extracted from theindex and are placed into a new data store of the form:Actor_D|reply_to_actor_ID timestamp|item ID. This allows three veryimportant measures to be taken:

-   -   The standard weekday (and non-holiday) distribution in time of        each actor's communications to other actors indigenous to the        corpus. For example, John Smith might be at work and on-line        generally from 7:00 AM-4:00 PM, whereas for Jane Jones it might        from 10:00 AM-6:30 PM.    -   Absences of that actor, for example due to vacations or sick        days; weekends and major holidays are automatically disregarded.    -   The mean response time to communications from each actor that        they interact with a sufficient amount in order to be able to        build a statistically significant model.

This derived data is also stored in a data store.

Warped Time: In a corporate or work context the vast majority ofcommunication among actors is likely to occur during or near “regular”business hours. In order to have a meaningful measure of mean responsetime to communications, the system must account for the fact that allactors have regular periods of inactivity, and that these periods mayvary widely by actor. FIG. 22 b illustrates warped time. The systemsolves this problem by discounting hours that the actor was not active(Block 2290). For example, if Jane Jones logs off her system at 6:00PM(Block 2282), and Bob Smith sends her an email at 8:00 PM that samenight (Block 2284), if she logs back on at 9:15 AM the next day (Block2286), and replies to Bob's email at 9:45 AM (Block 2288), only ½ hourof “warped” time will be considered to have elapsed. In anotherembodiment of the invention, this information is only considered on anaverage basis rather than on a per day one, except for those days whichare of particular interest to the user. Note that messages to whichthere was no reply are not considered at this stage of the analysis.

The Importance of Communication Documents

Threading of email, or simulated threading of IMs across sessions areconsidered a source of evidence. Both show evidence of a sustainedinteraction among particular actors. Such threaded material has a highprobability of being semantically related. However, this is only onedimension of evidence, and can be refuted by other kinds of evidence inthe discussion building process. This is to account for the fact that auser may, for example, hit the “reply to” button merely as a matter ofconvenience as opposed to with the goal of literally replying to themessage.

The Problem of Clock Drift

Clock drift on a machine sending email causes Sent: and Date: timestamps to be incorrect. Clock drift on an SMTP relay machine (anintermediate machine) will cause transport header timestamp informationto be incorrect; the mail server that contains a user mailbox is thelast intermediate machine in a message delivery route. An email isproduced by an email client at an origin machine. The origin machinecreates a Sent or a Date header (usually when the Send button ispressed).

A wrong timestamp can invalidate numerous data, including the following:causal ordering of messages; reconstruction of reply chains of messages;average time before answer to an email. Therefore, the system endeavorsto correct timestamps for clock drift before the portions of the modelthat depend on it are computed.

Note that computer clocks are often adjusted by their owner, especiallythe drifting ones. Therefore it is dangerous to extrapolate a drift ratefrom an isolated point in time. The technique described in FIG. 22 adoes not rely on drift rates computation. Instead it relies on the factthat certain timestamps should be ordered in a certain way in order torespect causality of events. For this reason, the system does notconsider clocks to be slow or fast, but rather early or late clocks.Most clocks are correct, i.e. neither detectably early or late. Thequality of a clock being early, late, or correct is per-message. Thishas the benefit that a clock adjustment between two messages does notcause a problem.

The system attacks the clock drift problem by exploiting available SMTPinformation. Each time an email is routed and received by anintermediate server machine, a Received: header containing a timestampis pre-pended to the message headers. Consider the last Received headerin a message. This is the earliest Received header in causal order(independently of what the actual time stamp values say).

The system skips messages that do not contain both the Sent time headerand the (last) Received time header. Repeat the remaining steps for eachother message.

Part one: detect late clocks that stamp a Sent header:

If available then consider the second to last Received header and callit Received-2. Note that if clocks were perfectly synchronized then theorder consistent with causality is Sent<=Received<=Received2. (Theinequalities are non-strict because time resolution is only one secondand so equality is possible.)

If Received>=Sent then the causal order is respected and no clock driftcan be inferred. Skip this message.

Otherwise the reason causal order is apparently violated is a clockdrift between the sender's client machine (Sent time) and theintermediate machine (Received time). The system assumes that there is auniversal time that is followed approximately by most machines. Inparticular it is assumed that only one of Sent, Received, Received2 isaffected by time drift.

If Received2 <Received then the intermediate machine Received isdrifting (since one of Received and Sent is drifting and at most one ofSent, Received,Received2 is drifting). This drift is unimportant. Skipthis message

Else Sent>Received and in addition it is the origin machine (Sent) thatis drifting. This is the case of interest to us since the Sent timestampon the email must be corrected. We know the origin machine currently islate at least (Sent-Received) time units. SetSent_corrected:=Received-1. This is a timestamp consistent with causalordering.

Part two: detect early clocks that stamp a Sent header.

Build partial email threads based on Message-Id/References links, asdiscussed elsewhere. Do not base any decision on Sent: time header fornow.

Do a depth-first traversal of the threads. Consider child-parent messagepairs (a child is a reply to its parent). If clocks were perfectlysynchronized then the order consistent with causality isParent.Sent<=Parent.Received_1<=. . .<=Parent.Received_n<Child.Sent<=Child.Received_1<=. . .Child.Received_n2. If Parent.Received_n or Child.Sent is not availablethen skip the pair.

If Parent.Received_n>=Child.Sent then causal order is violated. In thiscase proceed with the following steps.

The reason causal order is apparently violated is a clock drift betweenthe last receiving machine for message Parent and the origin machine formessage Child. We use the other inequalities to determine which one iswrong.

If n>1 and Parent.Received_n>Parent.Received_n−1 then the intermediatemachine is drifting. We are not interested in this drift. Skip thismessage pair.

If n=1 and Parent.Received_n>Parent.Sent then this is a case alreadyhandled by Part one above. So do nothing.

Else Parent.Received machine is not drifting. We know the origin machinefor the Child message is early at least (Parent.Received_n−Child.Sent)time units. Set Child.Sent_corrected:=Parent.Received_n+1. This is atimestamp consistent with causal ordering.

Note thatParent.Received_n+Reaction_time=True_Child.Sent=Child.Sent+EarlyDrift.(where EarlyDrift>0 and Reaction_time is the delay between the arrivalof Parent in the recipient Actors' mailbox and the sending of the Childreply by this Actor.). So in order for the violationParent.Received_n>=Child.Sent to occur, we must haveReaction_time<Drift. Otherwise no clock drift is detected.

Parts one and two are independent and can easily be computed togetherduring a single depth-first walk. Time (via the Sent/Date email headers)is used elsewhere in the invention as a component of an emailfingerprint for de-duplication purpose. In that context it does notmatter whether the stamped time is precise. What matters is that it isunique (most of the time) and that it is transmitted unaltered. In oneembodiment, the clock drift correction is not applied to the timestampused for email fingerprinting.

Formal Workflow

Commercial workflow systems such as Intraspect can furnish the systemwith valid workflow sequences. Specifically, it can provide thefollowing information:

-   -   At the Nth step in the workflow process, the identities of the        document or communication types that must have preceded it.    -   If the Nth step has occurred, similarly what the parameters are        of the next step, if it occurred.    -   It may also possibly indicate which specific actors, or which        job titles, are required to move the workflow to the next step.    -   Similarly, it may also indicate the date by which each        subsequent step would have had to have occurred in order to be        valid according to the rules specified in the system.

As previously noted, evidence provided by workflow system is consideredto be “hard” evidence. While evidence of some of the data involved inthe workflow may no longer be present in the corpus, a placeholder itemwill be inserted for any item whose sequence in a particular instance ofthe workflow process is less than the highest numbered item foundassociated with that instance.

Linguistic Evidence Sources

Different kinds of linguistic evidence can allow the system to expect aparticular type of related information, or to expect that subsequentrelated items will have certain characteristics, thus making them easierto locate. These are as follows.

Ontology

The invention implements its own ontology 2128 representation gearedspecifically to representing relations between and constraints on termsor phrases. FIG. 21 b is a block diagram of one embodiment of theelement relationships in a sample ontology. Ontology terms 2148, 2150,2152, 2154, 2156, 2158 and classes 2130, 2132, 2134 can be mapped bothto queries in the query engine 3120 implemented in the invention as wellas matching procedures that find terms and classes which describe aninput text stream. The latter method is referred to throughout thepatent as “matching an ontology class”. The ontology representationincludes:

-   -   Terms (blocks 2148, 2150, 2152, 2154, 2156, 2158)—which can be        either individual tokens or phrases    -   Classes (blocks 2130, 2132, 2134)—which can have other classes        or terms as children. A class can have multiple parents.    -   Relation links (block 2160)—between any combination of class and        term. Relation links can be either directed or undirected. A        directed link is equivalent to a “TRIGGERS” link in a        conventional ontology, and an undirected link is equivalent to a        “SYNONYM” link in conventional ontology implementations.

Every term and class may have clauses (blocks 2138, 2140, 2142) andrefuting clauses (blocks 2144, 2146) attached (attachment 2136.). Theseclauses are queries composed in the query language supported by thesystem. The function of support and refute clauses is not symmetrichowever. Every term and class in the ontology 2128 inherits (seeconnector 2132) all support and refute clauses from its parent classes.For a term to match (in a stream) or be matched (in a query) therefuting clause must not be matched. However a match on the supportingclause is not required. Supporting clauses are intended as a mechanismfor improving the match score for terms that appear in a certain contextor are associated with other query-able data in the system.

In one embodiment, there are no prohibitions on cycles appearing in thegraph of relations on classes and terms in a particular ontology2128. Inone embodiment, algorithms on the ontology 2128 are required to detectcycles of links. Given the restricted set of relationships allowed inthe ontology cycles can be dealt with consistently (i.e. inheritance ofclauses and relation links is monotonic).

Any class can be converted into a query of the form: (+(terms childrenrelated) supports-refutes), where terms is a list of clauses constructedfor each term that is a direct child of the class, children is a list ofclauses generated for each child class, related is a list of clauses foreach term or class accessible via a relation link originating at theclass. Supports is a list of support clauses associated with the classand refutes a list of refute clauses. The ‘+’ means that a match isrequired on at least on sub-clause in the initial list. The ‘−’ meansthat the query fails if any of the clauses in the refute list arematched.

Similarly a term can be converted into a query of the form: (+(termsrelated) supports-refutes). This represents a canonical form for thequery, the generation algorithm can optimize queries to remove redundantclauses.

The matching procedure is a bottom-up process that scans a text streamfor occurrences of terms in the ontology 2128. If a term is found in thetext stream, it only matches if no matches for any refuting clausesassociated with that term are found. This is equivalent to the statementthat the text stream would be returned in the result list for a query ofthe form (+term inherited_supports inherited_refutes) if it appeared ina document. In one embodiment, system components may utilize any of thefollowing three methods to determine which classes will be output when amatch is found:

-   -   Only output classes which have specifically been declared to be        visible externally. When an ontology 2128 is defined, classes        may optionally be labeled as “visible”. In this method, the list        of visible classes that include a term as a member are output        each time it is matched.    -   For each term matched, return the most remote ancestor class        that contains that term as a member. All terms in the closure of        the class (i.e. terms directly in the class, terms directly in        all reachable subclasses of the class) as well as related terms        and closures of related classes are considered members of the        class for this test. A class or term is related if relation link        exists from it to a class or term in the closure of the original        class. If a term's text matches but the match is rejected by a        refuting clause, then find the most remote ancestor that if        considered as the root of the class hierarchy (i.e. no clauses        are inherited from the parents of that class) would allow the        term to match. The most remote ancestor is the ancestor with the        largest count of intermediate classes between itself and the        term in question. If there are two or more classes with the        maximum count, then a list of all such classes are output for        the match. If the ontology 2128 contains a cycle then the last        class that is traversed before the cycle is detectable is taken        to be the most remote ancestor in the cycle.    -   Return the least remote class (in other words the most specific        class) that contains as members all terms in a sequence of        matches. The length of the sequence is determined by a parameter        that is supplied when the search is initiated. The parameter        specifies either that all matches in the text stream be        considered as a sequence, or provides the length of the sequence        to use. In the latter case, the sequence starting at each match        in the document is evaluated. In other words, if the sequence        length is three, then sequence of matches (1, 2, 3) is tried        followed by (2, 3, 4), and all remaining 3 match sequences.

This results in a hierarchy of term classes, which can trap the presenceof different topics in a document. Each ontology class may containsupporting evidence for establishing its topic's presence in aparticular document. It may also contain refuting evidence. Thereforeontology classes can easily distinguish between different senses of aterm, for example, Java the coffee, versus the country, versus theprogramming language. One embodiment of the invention allows each term,class, relation, and support or refute clause to be weighted positivelyor negatively in order to influence the score for corpus items returnedfrom a query as well as term matches in a document. The use of the querylanguage enables tests against accrued evidence to be performed andthereby creates a powerful system for automating analysis of a corpus.Since in some embodiments and configurations ontologies may be used atvarious stages in the process, in one embodiment, the system implementsconsistency checks of the supporting and refuting clauses against theevidence available whenever an ontology 2128 is loaded and used.

Language Detection

A number of commercial and academic sources provide schemes forrecognizing a natural (human) language based on a sample of text. In thecurrent invention, such schemes are used in conjunction with thesociological analysis engine for the following purposes:

-   -   Personality spotting: inasmuch as a previously recognized actor        is observed to use more than one human language beyond the use        of isolated individual phrases, each language is assigned to a        different personality 1220 for that actor, in the event that        each such language usage corresponds to a different electronic        identity 1225. (This last requirement is to prevent incorrect        flaggings of, for example, multi-lingual employees who perform        their job in more than one language. However, some embodiments        of the invention may choose to make this trade-off.)    -   Cliques (“circles of trust 1505”): Cliques which substantially        use a language other than the primary one used in the corpus can        be optionally flagged as suspicious by the sociological engine;        ontology hit scoring thresholds would be lowered to any single        occurrence of a positive evidence term. This is due to the fact        that foreign language use may be indicative of an attempt to        obfuscate, to pass sensitive or restricted information along        without risk of detection by any keyword-based filters that        might be in place.

In addition to these purposes, the system also makes use of languagerecognition schemes to mark documents or regions of text fortranslation, if desired.

Author Fingerprinting

The system applies statistical natural language processing techniques inorder to help determine authorship of documents lacking valid authormetadata, or of textblocks which cannot be unambiguously assigned to anactor. Techniques used by different embodiments of the inventioninclude, but are not limited to, the CUSUM method. The system can inputverified samples of the writing of each actor in the actor set in orderto increase the accuracy of these methods. The relevant informationneeded to fingerprint an actor is stored in the “lexical fingerprint”attribute, discussed above.

Quantitative Sociolinguistics

The heavy use of sociological data in the system lends itself to theaddition of sociolinguistic data, in particular, notation of thediscourse role relative to communication between any given set ofactors. The discourse role includes the notions of the pragmaticfunction, linguistic register (i.e. the relative formality orinformality of language used) and the intended audience of thecommunication. The primary implementation of sociolinguistic techniquesin the current system applies to labeling the discourse role of a regionof text with a pragmatic tag. This role must necessarily be consideredcontextually. For example, given a document containing a high percentageof sentence fragments and lacking formal openings and closings, mostlinguists would characterize such a document as ‘informal.’ However, inemail communication, this may be an inaccurate categorization.

Pragmatic Tags

The present invention, in one embodiment, uses theory and methodologydeveloped in the field of linguistic pragmatics and discourse analysis.Linguistic pragmatics seeks to explain the relation between language andcontext, including the structure of spoken or written discourse, thenature of references in communications, and the effect of context on theinterpretation of language. While a substantial body of work exists oncomputational applications of linguistic pragmatics, the adaptation ofsuch techniques to the sociological engine described here is a uniqueone. Specifically, pragmatic tags are used as a cue to thediscussion-building algorithm and as a means of retrieving informationwithin a coherent discussion that might be distributed across multipletext-containing items. FIG. 23 a describes the spectrum of pragmatictags that may be available. The application of pragmatic techniques inthis regard represents a method unique to the system of this patent.

The system initially assigns pragmatic tags 2301 to textblocks whileanalyzing documents during discussion building. In one embodiment,assigned tags 2301 are stored at document vertices in the graph andavailable for later processing stages. An alternative embodiment of theinvention may assign these tags 2301 in an earlier phase of processingand store the results. The pragmatic tag set described here is unique tothe invention and defines the discourse properties both relevant todiscussions and used for various analyses of constructed discussions.These pragmatic tags 2301 represent the discourse role distilled fromexamining the textblock as a whole, where a consistent role can bedetected.

The tagging process uses an initial labeling pass, which labels text ata finer level of granularity with lower level tags 2301. Examination ofthe sequence of labels produced from this pass provides the strongestsource of evidence for determining which pragmatic tag will eventuallybe assigned to the textblock. The invention uses currently practicedmethods in computational linguistics, generally referred to as shallowparsing, to assign these lower level tags. The labeling processdescribed below can take as input any tag set produced by a shallowparser so long as the tag set can represent a necessary set ofdistinctions described below. Automatic tagging presently available inshallow parsers is most reliable for labeling parts of speech (forexample noun or verb), some syntactic constituents (for example nounphrase) and some syntactic relations (for example subject or object).One embodiment assumes the use of such tags. However, there are severaltag sets in wide usage, such as the DAMSL (Discourse Annotation MarkupSystem of Labeling) tag set, which also represent pragmatic anddiscourse roles for the labeled text. The Use of a labeling method thatprovides such information may allow the system to produce more accurateresults.

The pragmatic tags 2301 assigned to textblocks in a document are themost likely role given the sequence of tags and other forms of evidencefound. One embodiment of the invention implements a maximum likelihoodestimation approach to doing so, although alternative embodiments mayuse heuristic, rule-based or other approaches. The input tag set shouldat a minimum be able to distinguish between various forms of question,declarative and imperative sentence structure. Ideally, assigned tagsshould give some indication of whether anaphora or resolved or not. Ifnot some simple heuristics will be used to estimate whether unresolvedanaphora may be present.

The assignment of pragmatic tags 2301 follows standard formalassumptions of linguistics, namely that actors provide information thatis (1) relevant, (2) appropriate in quantity (that is, that allinformation introduced into the discussion can be assumed to fulfillillocutionary intent of the speaker) and (3) qualitatively appropriate(that is to say, that it is either true, or if false, the falsehood isdue to illocutionary intent of the speaker, rather than linguisticincompetence). These three assumptions guide the system in classifyingall available information and linguistic phenomena relative to itssocial context.

The following linguistic attributes are considered by the system inassigning a tag to a given document (or region of text):

-   -   discourse markers, lexical markers: The presence of words from        closed sets of particles or discourse markers may indicate        relevant details about information flow, indicating discourse        events such as hedges (“on the other hand”), conventional        openings, closing and politeness markers (“thanks,”) and        linguistically relevant structural markers (“if,” “whether”).        The system uses an ontology as defined here for the purposes of        the invention, to represent classes of lexical or discourse        markers. The matching functionality is used to identify markers        as each document 505 is processed.    -   personal pronoun usage: the target audience can often be        inferred from the personal pronouns (or lack thereof) used in a        document 505. “I” vs. “we” provides an important descriptor        relative to speaker attitude and inclusion of other actors;    -   “you” or “you guys” vs. indirect references to addressees        indicates formality and/or breadth of target audience, and        direct or indirect reference to third parties (“he”, “they”) are        also telling indicators of group relations.    -   syntactic structure: sentences containing unresolved anaphora,        for example, indicate that an antecedent is required. (A message        such as “Don't bother telling them”, with unresolved “them,” is        obviously an addendum or response to an earlier question or        statement.)    -   reported speech markers: an indication that “X told me . . . ”        or “Y says . . . ” is of interest as it may indicate the role of        a third party with an indirect effect on a discussion.

Another embodiment of the invention may take into account other evidencetypes in addition to linguistic attributes. These include, but are notlimited to evidence such as: document type, lexical similarity totextblocks in parent messages, textblock length, presence of quoted andother forms of shared content, and known revisions will be taken intoaccount in assigning labels.

Core Pragmatic Tag Types

Initiator 2306: An initiator tag marks text that is intended to get aresponse from other actors (see block 2630.) It typically introduces newcontent with little reference to older content. Typically initiators canbe found at the beginning of a discussion, though when found in themiddle of a discussion they may mark a change in topic, an attempt toreinvigorate a lagging discussion or similar purpose. A number oflinguistic and extra-linguistic factors in a document's structure serveas indicators. For example, lengthy documents containing a large numberof complex noun and verb phrases typically indicate a drastic change ininformation flow. Therefore, such documents tend to be initiators.

Clarifier 2307: A clarifier (see block 2632), also commonly called aforward-looking statement, is somewhat more content rich but internallyinconsistent; in other words, it contains a mixture of old and newinformation, indicating that both prior and subsequent information arerequired to resolve its content. In particular, clarifiers can indicateinclusion, exclusion or references to actors that can expand or restrictsubsequent discussion flow.

Question 2308: While clarifiers are typically questions as well, thequestion tag indicates a class of questions that are simpler thanclarifiers. Text segments with these tags tend to be shorter thanclarifiers, they do not introduce new content, but rather refer to oldercontent. They may also indicate an attempt to sustain a discussion.

Indeterminate 2309: This tag is for the fall through case where no clearindication for one tag of any other can be found.

Filler 2310: A filler tag indicates a class of responses that do notcontain any particular content. These responses are often formalitiessuch as “Good job!”, “Thanks for sending that”, and so on. They tend tobe short, do not introduce new material and are generally identified bythe matching of discourse markers.

Response 2311: Shorter documents found to be lacking in distinctinformation can be tagged as responses (see block 2634.) In general,“response” segments are relatively short and contain a number ofstatements with unresolved noun phrase co-references. This typicallyindicates that a prior document's content is needed to resolve thereference. In cases where noun phrase co-references are resolvablewithin a single short document (two or fewer paragraphs), it is stillsafe to assume that such a document will require additional outsidecontext for complete intelligibility.

Summarizer 2312: As opposed to responses, summarizers tend to be richerin use of named entities and complex noun phrases (see block 2636.)Summarizers do not introduce new content, and contain larger amounts ofold content.

Null 2313: Certain classes of documents do not yield to pragmaticanalysis, either due to the absence of clear linguistic structure (as inthe case of a spreadsheet) or due to being so highly structured thatsubjecting them to analysis would introduce unacceptable levels ofcomplexity into the system for which the pragmatic tags were originallyintended, namely, ordering and retrieving related sets of documents.Such cases are eliminated from consideration by pragmatic tagging andreceive the default ‘null’ tag.

Pragmatic tags 2301 play an important role in discussion construction,as they present data that indicates initiation, sustenance or conclusionof a discussion, points to missing information, and mark the emergenceof topic drift. In the implementation described here, each tag is atuple of 4 values. The first value, the tag type 2302, defines which ofa set of broad classes (corresponding to linguistically relevantattributes) the tag belongs to. One example of the tag set that may beused for this embodiment appears in the section above. Alternativeembodiments may adapt the tag set in one of several ways. A richer tagset could be used to break out finer levels of distinction along thesame dimensions as those present in the core tag set For example therecould be several levels of tags added between Initiator and Clarifier,representing finer gradations. Another direction that might be taken isto add further values to the pragmatic tag tuple to describe othercharacteristics of the tag.

The second value in a tag tuple represents the register 2303 of thetextblock. This value can be one of: ‘+’, ‘−’, “*”. Representing aformal, informal or undetermined register for the textblock. Register2303 can be detected by techniques including but not limited toexamining the sentence composition of a textblock and pronoun usage aswell as the use of specific lexical or discourse markers.

The third value in a tag tuple characterizes whether or not thecommunication is directed towards a particular audience or not, forinstance talking about someone in the third person, e.g. the audience2304. This value can be one of ‘+’, ‘−’ or ‘*’, indicating that thecommunication is directed, undirected or undetermined respectively. Thedirected nature of a communication can be detected by techniquesincluding but not limited to the use of lexical or discourse markers andpronoun usage. Additionally a check can be made to see if a named personexists in the recipient list for a communication to indicate whether ornot the communication is directed.

The fourth value in a tag tuple indicates the specificity of the speaker2305. It measures whether the author is speaking on their own behalf orthat of an organization or group of people. This value can be one of‘+’, ‘−’, or ‘*’ for specific, plural or undetermined respectively. Thisvalue can be determined by techniques including but not limited toexamining pronoun usage throughout the textblock.

Some examples of initiators are:

EXAMPLE 1 An Email Containing Only the Following Text

Hey, gang. I just got back from Comdex and saw some of the newspecialized speech recognition headsets that Acme was demonstrating. Ithink they'd be perfect for our automated transcription project. Do youthink there's money left in the discretionary budget for these? If thereis, who do I need to talk to about getting a purchase order?

EXAMPLE 2 A Long Document Formalizing a Company's Sexual HarassmentPolicy Could be Tagged as “Initiator,+,−,−”

Such a textblock would very likely be written with few or no first orsecond person pronouns, indicating formality and the intention ofreaching a broad, nonspecific audience. Simple counts of the number ofparagraphs and average number of sentences per paragraph suffice toidentify the structure of most such documents 505 as formal. Finally, itcan be assumed that any such policy statement will contain a relativelyhigh count of named entities (company names, responsible parties, andthe like) that would indicate its key role as an introductory or summarydocument 505 in an information flow.

The existence of tag sequences permits an additional query type forinformation retrieval. Specifically, locating documents that affirm ordeny a previous statement or request, but are lacking in contentthemselves, are made available to users of the system by means of thisset of tags. For example, a query of the type, “Did Lulu ever ask Fritzabout the flux capacitor, and receive a response?” is possible throughsearches of content in conjunction with pragmatic tags 2301.

The core tag types can be thought of as points lying along a spectrumfrom Initiator 2306 at one end to Summarizer 2312 at the other. The tags2301 at either extreme are considered more distinguished than those tothe center. As pragmatic tags 2301 will be used as positive evidence forthe likelihood that an item is part of a discussion, the procedure fordetermining tags is set up so that any errors that occur tend to bemislabeling the more distinguished tag as a less distinguished one, forexample Clarifier 2307 for Initiator 2306.

One embodiment of the labeling procedure proceeds as follows:

-   -   Items to be analyzed are first classified by type. In one        embodiment, only communication documents, communication events,        regular events, non-structured documents are considered for        non-Null tags. In other embodiments, all items except for        structure documents and other fielded documents are considered.        Additionally items that are found to be part of a formal or        informal workflow are assigned Null tags, as their role is        defined by their position in the workflow sequence (blocks 2318,        2319, 2328, 2320.)    -   Textblocks from the remaining items are selected for analysis        (block 2321.) The textblocks selected must be attributable to        the author of the document 505 (blocks 2322, 2327.) Shared        content, revision history and other constraints are used to        delineate each textblock in a document. Only those blocks not        attributable to another actor, or marked as quoted text (as in        an email communication for example) are considered attributable        to the author of the document.    -   Textblocks that are too large are assigned a Null tag of the        form “null,*,*,*” (blocks 2323, 2327.) The length threshold is        set to agree with a labeling policy used for labeling of a        training corpora as described below. In one embodiment, any        textblocks over this limit will automatically be assigned a null        tag. For example, such a limit could be set at 2 kilobytes of        characters of message content, following text extraction from        formatted mail types such as HTML. Alternative embodiments could        add other heuristics for making this determination. The issue of        length tends to be self limiting in that it will be less likely        that longer textblocks can be assigned a clear pragmatic tag        2301. Longer textblocks will tend to carry characteristics of        each of the core tag types and therefore the likelihood        calculated for each tag type will tend to the same value in the        scoring procedure mentioned below. Therefore longer        textblocks—those remaining after the initial removal of those        above the length threshold—will get an Indeterminate tag, above        some naturally occurring limit and it is safe to err towards a        higher limit.    -   For the remaining items which have not been assigned a Null tag,        perform the first pass of labeling using a shallow parser as        discussed above (block 2324.) Minimally only a unique tag        corresponding to each core tag type is required, though a richer        tag set with a finer breakdown of classes can be used. In        addition add a tag for each lexical or discourse marker        identified by matching against their respective ontology classes        (see ontology definition section above). If the tag set produced        by the shallow parser does not indicate unresolved anaphora        heuristics may also be used to decorate tags to indicate this.        The preferred embodiment simply counts backward from any        pronominal noun phrase looking for compatible noun phrases that        could serve as an antecedent (block 2325, and blocks 2346, 2347,        2348, 2349, 2350, 2351, 2352.) A compatible noun phrase meet        constraints such as agreement on person and number attributes        (block 2352.) Additionally pronominal noun phrases are skipped        over during the count back procedure. Optionally the count back        can be limited to the beginning of an enclosing paragraph, or to        a certain number of words. However, a more conservative strategy        of counting back till either a compatible noun phrase is found        or the beginning of the textblock is reached will make fewer        false positive errors.    -   Construct a Markov chain (blocks 2326, and blocks 2357, 2358,        2359, 2360, 2361, 2362, 2363, 2364) for each of the possible tag        types using the state transition matrices created during the        training process described below. Labels from the textblock are        mapped to states in the Markov model in the same way as in the        training process (block 2361.) This yields a set of scores, one        for each potential tag type. If one tag is found to score        significantly higher than the others then that is initially        proposed as the tag type (block 2329) A typical technique for        determining this is to square each of the scores and threshold        the remaining values, say at half the value of the maximum        score. In other embodiments, other exponents and thresholds (as        determined by the previously described parameter estimation        process) can be chosen. If more than one score remains and the        core tag types for these scores fall in one half or the other of        the spectrum defined above then hypothesize the initial tag as        either Clarifier or Response respectively (locks 2329, 2330,        2335, 2337.) If there are several tags that are distributed on        both halves of the spectrum, then assign an Indeterminate tag        type (block 2336.) If there was no single winner, then apply        heuristics similar to those below. The exact heuristics will        depend on the properties of the shallow parser chosen. The        threshold used, or parameters obtained for other techniques, can        be obtained from standard practice parameter estimation        techniques used during the training process.    -   In the case where a clear winner was not established and all        high scoring tags fall into one half of the spectrum, the        following heuristics are used to finalize this hypothesis. The        quantity of old information, new information, and information        content are estimated, in one embodiment using methods described        elsewhere in the patent. The hypothesis is then modified as        follows:        -   if hypothesis is Clarifier, the textblock has no old            information, adds new information and has a high information            content then move to Initiator (blocks 2331, 2332.) On the            other hand if the textblock has no new information, low            information content, and either has some old information or            unresolved anaphora then move to Question (blocks 2333,            2334.)        -   if the hypothesis is Response, the textblock has no new            information and high information content then move to            Summarizer (blocks 2338, 2339.) On the other hand if it has            low information content then move to Filler (see block 2340,            2334.)    -   Once the tag type has been determined, the remaining tuple        values are calculated. In one embodiment Register is primarily        determined by the presence of lexical or discourse markers        (blocks 2341, 2369, 2370, 2371, 2372, 2373, 2374, 2375.) If a        prevalence of formal or informal markers is found the value will        be assigned to ‘+’ or ‘−’ respectively (blocks 2371, 2372, 2374,        2375.) Additionally indication of more complex sentence        structure and length of the textblock indicate a formal register        (block 2373, 2375.) The exact heuristic depends on the tag set        produced during shallow parsing, but will consist of checks for        coordinate noun phrases relative clauses and other complex        syntactic constituents. For the preferred embodiment prevalence        is defined to be >=66% or <=33%. One embodiment considers email        and other communications to be informal by default and documents        505 to be formal by default (block 2369.)    -   In one embodiment the directed value is determined by examining        the prevalence of 2^(nd) vs 3^(rd) person pronouns to assign        ‘+’, ‘−’ values (blocks 2374, 2375, 2376, 2377, 2378, 2379,        2380, 2381.) A prevalence of 2^(nd) person pronouns indicates        directedness whereas third person indicates non-directedness        (blocks 2374, 2375, 2378, 2379.) Additional heuristics may be        applied, such as the presence of discourse markers representing        directedness (blocks 2374, 2374, 2380, 2381.)    -   In one embodiment the speaker is determined by examining the        prevalence of 1^(st) person singular vs 1^(st) person plural        pronouns; in one embodiment, prevalence is determined as        described above (blocks 232382, 2383, 2384, 2385, 2386, 2387,        2374, 2375, 2388.)

The accuracy for determining register, directed, and speaker values canbe enhanced by looking at syntactic relations if available. The use ofMarkov chains is widely understood standard practice, however a briefexplanation is recounted here.

The Markovian property applies for a series of events X if theconditional probability P(X_(n+1)=q_(i)|X_(n)=q_(j), X_(n−1)=q_(k), . .. )=P(X_(n+1)=q_(i)|X_(n)=q_(j)), in other words the likelihood for thenext state in the system is dependant only the prior state. A largeclass of problems can be effectively solved by assuming this conditionto be true. For the purposes of the invention, this assumption yieldsacceptable results. Define a set of states Q with a transitionprobability matrix {p_(ij) for all Q_(i), Q_(j) element Q, 0<=p<=1}, inaddition each row of the matrix {p_(i,1, . . . |Q|)} must sum to 1.Calculation of the likelihood for each event in a chain is obtained bymultiplying the likelihood obtained for the last event by the transitionprobability from the state representing the last event to the staterepresenting the current event. Often the prior event's likelihood ismultiplied by an additional scaling factor to minimize loss ofprecision.

One embodiment of the invention defines a procedure that maps thesequence of labels from the textblock into states of Q. For example theshallow parser will likely create nested labeling structures if it marksboth syntactic constituents and parts of speech. An example of this kindof labeling might be “ . . . [NP pro3s:he] . . . ”. In this case themapping procedure can simply concatenate the tags to produce a label ofthe form “NP,pro3s” with a corresponding state in Q. Additionally, thisprocedure should filter out labels that are not deemed to be significantto categorization, as may be determined by a competent linguist duringthe training phase if desired.

Other embodiments may use a higher order Markov chain. In an n-orderMarkov chain, the transition probabilities measure the likelihood of astate given the prior n states of the system. In any case the statespace may be large, and standard techniques for representing sparsematrices may be used to implement the transition matrix. If necessarythe matrix may be sparsified by using the technique of removing lowprobability transitions and assuming a standard lower bound duringcalculation when no transition is found between a pair of states.

Training requires a set of textblocks that have been labeled using thesame shallow parsing system used in the above procedure, with discoursemarkers and other decorations added as above. Each block will beassigned a tag type by human evaluators. The described embodiment of theinvention requires a separately trained model for each tag type.Training is a simple process of scanning through the tags in a textblockand applying the mapping procedure to obtain a sequence of states, thencounting the occurrences of each pair of states (or n+1 tuple of statesfor an n-order model). To account for possible sparseness in thetraining set, the count for each possible pair of states may beinitialized to 1, or a more sophisticated standard technique may bechosen. After all counts have been collected normalize the rows of theresulting matrix so that they each sum to 1. Training may be done onceon the basis of a large generic training corpus and then reused for thecorpus to be analyzed by the invention, or for better results a trainingcorpus with characteristics corresponding to a particular corpus to beanalyzed, or from a sample drawn from the corpus to be analyzed.Training typically involves a verification phase, and parameterestimation can be done by attempting several iterations of training andverification in which parameter values are altered and using theparameter set that produces the best results.

Alternative embodiments may use other statistical classificationschemes, such as Hidden Markov Models, that retain the essentialcharacteristics of the system defined here, namely that tags areassigned to textblocks with appropriate characteristics as defined aboveand that classification errors result in labeling a textblock with aless distinguished label than desired.

Discussions

After the feeder elements for the discussion generation process havebeen extracted, derived, and/or calculated, the system is ready to beginbuilding discussions. As previously noted, the goal of this process isnot to include every item in the corpus in a discussion, but rather toreunite related series of events.

Specifically, a discussion can be defined as a (probabilistically)causally related set of heterogeneous items within a corpus. Discussionscan be of arbitrary length and time span. Similarly, they can contain anarbitrary number of different topics, and have an arbitrary number ofdifferent actors associated with them. The only limitation in discussionconstruction is that there is adequate evidence to connect the N+1nthelement to the Nth. The methods for determining adequate evidence aredetailed below.

In building discussions the system accrues evidence across a number ofdifferent dimensions in order to draw its conclusions about whichdocuments and events should be woven into a discussion. Any individualpiece of evidence may be refuted by other pieces of evidence. Forexample, the mere fact that an email message B was a “reply to” emailmessage A does not by itself necessarily imply that A and B are related;the user may have hit the “reply” button merely as a convenience. Ifboth messages have substantial lexical content, but which appear to betotally unrelated to one another (apart from an unmodified quotedtextblock) that would be refuting evidence against there being a causalrelationship between A and B. Additional refuting evidence would be amanually performed change in the subject field, or message B having verysimilar lexical content to message Z, which preceded A in time andinvolved the same actors.

The above example notwithstanding, in one embodiment of the invention,the four strongest types of evidence are:

-   -   Actor presence    -   Reply to or forward (of communication documents)    -   Formal or informal workflow process    -   The inclusion of either the same attachment or another version        of the attachment, or one or more textblocks.

In one embodiment, the presence of at least one shared actor is anabsolute requirement to link items together. However, an aggregate actor(such as a mail alias) is acceptable so long as once it is expanded, thecorrect individual actors are found. Date and warped time is also astrong supporting source of evidence.

Weaker forms of evidence include, but are not limited to:

-   -   Joining communications A and B, where B seems to require an        antecedent. An example of this would be the case in which B had        been pragmatically tagged as being an acknowledgment, and A had        contained an attachment.    -   Lexical content similarity    -   A document modification or access event following communication        about the document    -   A communication event following a part of a discussion on the        same topic

In order to further describe discussions at this juncture, the listbelow states the properties of discussions in one embodiment of theinvention.

Discussion Properties:

Name: This is calculated from the most frequently occurring documenttitle in the discussion. If there is a tie, one is selected randomly. Inthe event that a discussion contains no titled items, a title isselected by using the most frequently occurring sentence or phrase inthe discussion. In addition, the user may rename individual discussions,or specify alternate naming heuristics using other discussionattributes. FIG. 24 a illustrates one embodiment of the relationship ofdiscussions to other objects.

UID: This is generated by the system in order to uniquely identify eachdiscussion. Any reliable UID generation scheme may be used.

Items: ID's of the set of items contained in the discussion.

Lifespan: This is calculated from the time and date stamp on the firstchronological item in the discussion, and that of the last chronologicalitem in the discussion 305.

Primary actor 2406: The actor, or in some cases actors, who drive thediscussion. The primary actor 2406 will most often be the initiator ofthe discussion, and is specifically the person who generates both themost content, and the greatest individual actor frequency of items inthe discussion. In one embodiment of the invention, the amount ofcontent generated is measured by sheer volume uniquely attributable tothem as author, while in others it is only the content in so-calledcontentful documents that is considered. In one embodiment of theinvention, if these last two tests indicate different actors, bothactors will be considered primary actors 2406. The number of itemsgenerated by that actor will indicate which of the two actors is listedfirst when the discussion is referenced.

Contributor 2404: Actors who, while not the primary, generatesignificant content, specifically at least one contentful document. Inone embodiment of the invention, contentful document is one containingmore than a sentence, and which does not correspond to a known template.

Participant 2403: An actor who has at least performed one action in thecontext of the discussion, but who is not responsible for the generationof any contentful document.

Observer 2405: An actor who has received information pertaining to thediscussion, but for whom there is no evidence of any kind of response orother action.

Reviewer 2407: A user of the system who has seen, and possibly actedupon the discussion—for example, has annotated it.

Topics: A topic is considered to be present in a discussion if aspecific ontology class created to identify evidence of that topic istriggered. Alternately, in some embodiments, this may be done throughstatistical topic analysis, or a combination of these methods.

Has Deletions: In those cases where a discussion seems to be missing anitem that once existed, this value is incremented by 1. Examples of thisinclude, but are not limited to, a missing reply-to ID in an email, amissing early item in a workflow instance in which later items appear,references to named documents that can no longer be located, etc.

Revision: A discussion may have more than one version. This mostcommonly occurs when new data is made available to the system thatresults in a modification to the discussion, either an addition of newitems, or a reclassification into another discussion of an existingitem. In one embodiment, users may modify discussions. In this event,each modification is maintained as a separate numbered version. In oneembodiment, any content change in the discussion results in the versionnumber being increased by 1, whereas a change to an annotation resultsin the addition of 0.1. Note that in such embodiments, 10 annotationchanges to the same major version would not cause a change in the majorversion number. Rather, the 11^(th) such change would increment theversion by 0.01.

Annotations: A reviewer 2407 may annotate the discussion with arbitrarycontent, including links to other discussions.

Redact Info: If the discussion contains items that were redacted, thisfield contains a pointer to the redacted text, both content and(original) relative location in the document

Pragmatic Tags 2301: List of the pragmatic tags 2301 appearing in thediscussion.

Natural Languages: List of the spoken languages contained in thediscussion. Note that isolated individual foreign language phrases oritems appearing in quotes will not get counted. In one embodiment of theinvention, more than 1 sentence in a language other than the primary onewill be counted, but only if the content does not appear insidequotation marks.

Item Types: List of the item types appearing in the discussion. Forexample: Excel documents, emails, phone calls, etc.

Item Count: Total number of items contained in the discussion.

External Events 2104: List of external event IDs attached to thediscussion.

External Documents 525: List of external document ID's attached to thediscussion.

Related Discussions: List of discussion IDs that are either precursorsto, or offspring of, the current discussion 305. In one embodiment,various similarity metrics may also be included.

Partitions 2401: List of partition markers in the discussion 305.

Pivotal Items 2402: List of pivotal items in the discussion 305.

Max Depth: The greater of the length of the longest partition 2401, andthe length of the longest series of items on the same topic(s). Lengthin this instance is measured by the number of items. In otherembodiments of the invention, it may be used both as a query parameteror in relevance ranking; a high max depth on a particular topic is anindicator that there is considerable detailed content relating to it.

Resolution Item 2411: In a discussion 305 which has achieved resolution,the item number containing it. Otherwise null.

Special applications may provide or require additional attributes fordiscussions 305. For example, whether a discussion 305 containedprivileged material in a legal context.

Colored Graph Representation

Embodiments of the invention will be described using a typed coloredgraph formalism for the sake of consistency and convenience. FIG. 7 aillustrates one embodiment of the colored graph setup, and FIG. 7 bprovides a colored graph sample. The graph records relations betweendata items in the corpus as well as additional data items introducedduring analysis of the corpus in order to describe derived evidencetypes and the structure of discussions. In alternative embodiments someof the relations and metadata described below may be represented in amore efficient and/or optimized data structure, such as an invertedindex of the type used in information retrieval applications. Thefollowing processes and algorithms apply to any alternative embodimentthat accumulates evidence which is equivalent to the set of relationsand metadata described here.

The graph representation used here adds a type and color to each vertexin the graph as well as a color to each edge of the graph. In the graphimplementation described below, each edge and vertex can store a set ofdata, defined by a metadata schema 718. Colors are simply a code used toidentify the metadata schema associated with each vertex and edge in thegraph. Additionally, the color associated with an edge identifies thetype of relationship it defines between two data items. Vertices alsohave a type associated with them, which reflects the type of data itemthey represent (for example, Communication Document 510 or RegularDocument 515). For vertices the metadata schema is fully specified viathe combination of a color and a type. The consequence of this is thateach data item in the corpus may be represented by several vertices inthe graph, each with a different metadata schema. The different schemaare necessary for computing the various evidence types computed duringthe discussion building process, either to fully describe the computedattributes for each type of evidence or as intermediate data required tocompute such evidence.

A typed colored graph CG consists of the pair (CV, CE), where CV is theset of colored vertices and CE the colored edges. C is a set of colors(c_(x): x is an integer} (see blocks 702, 708.) T is a set of types{t_(y): y is an integer} (see block 704.) D is a set of unique ids{d_(n): n an integer} which identify items in the corpus (see block706.) CV is a set of colored vertices {v_(i)=c_(x)t_(y): i an integer,c_(x) element C and t_(y) element T} (see blocks 702, 704, 706.) CE is aset of colored edges {e_(j)=c_(x)v_(k)v_(i): c_(x) element C; v_(k)(head vertex) and v_(i) (tail vertex) are elements of CV} (see blocks708, 710, 712.) Edges can be either directed or undirected, which isdetermined by the edge color (see below). In the implementationdescribed all edges are navigable in either direction, the preferreddirection of an edge is recorded as part of the edge key.

The evidence accrual process consists of adding vertices and edges to acolored graph based on analysis of internal items in the corpus.Discussions are represented as an additional set of vertices and edgesadded during a decision process which takes as input this accruedevidence and determines which items are to be linked together into aparticular discussion. Further edges are later added for relationsbetween discussions and external items from the corpus as well asderived relations between discussions. The notations used are introducedbelow.

Vertex Types:

The colored graph used in one embodiment of the invention has thefollowing vertex types (see block 704):

-   -   AR: Alias Record—name address pairs extracted from documents        505. These pairs will be merged in order to construct actors        310.    -   A: Actor—an actor as defined earlier 310.    -   P: Personality—an actor personality 1220 is identified via a        distinct behavior pattern exhibited by an actor 310. (There may        not always be enough evidence to recognize two apparent actors        310 as two personalities 1220 belonging to the same actor 310,        or such evidence may only surface at a later point in an        incrementally updated system.)    -   NE: Named Entity—extracted named entities that do not resolve to        any of the other defined data item types. Examples of named        entities include, but are not limited to: document titles 505,        and actor 310 names.    -   CE: Communication Event 570—The significance for discussion 305        building is that these events 570 will appear as part of an        interactive discourse between two or more actors 310.    -   RE: Regular (Internal) Event 2102—These events 2102 may provide        evidence for linking together other data items.    -   EE: External Event 2104—Only participate in relations built        following discussion 305 building.    -   C: Communication—The significance for discussion 305 building is        that these events will appear as part of an interactive, turn        based discourse between two actors 310. These items will most        often provide the “backbone” for a discussion 305.    -   RD: Regular Document 515—These items most often appear as items        related to items directly involved in a discussion 305.    -   ED: External Document 525—These items only participate in        relations built following discussion 305 building.

In the implementation of the system these types may have some distinctsubtypes. For example Communication items consist of email messages,IMs, and so on. In practice this means that the methods for obtaining anitem id may be different for each subtype, or there may be differentrules for extracting metadata fields for the various metadata schemasassociated with the type.

Vertex Colors:

The colored graph used in the preferred embodiment of the invention hasthe following vertex colors. Note that for implementation purposes ofefficiency of sorting and retrieval of vertices, any of these colors maybe replaced by a set of equivalent colors.

AP1, AP2: Actor Analysis—vertices with these colors store alias recorddata.

T: Threading—vertices with this color store data used for constructingrelations that can be directly computed from the data item itselfwithout inference (see blocks 736, 724, 738, 746.) For instance emailmessage ids specify direct reply and forwarding relationships. Orindividual IMs captured in a session log. Meta-data for this colorincludes cc: versus bcc: and to: information.

S: Similarity—vertices with this color store data used duringcomputation of various sorts of similarity between non-fielded regulardocuments 515, which often require specialized secondary indexes to bebuilt (see blocks 744, 740, 748, 752, 758, 762.)

R: Reference—this color is used for generic vertices (see blocks 756,742, 750, 754, 760, 764, 2640, 2642, 2644.) These vertices do not needto store any specialized metadata.

D: Discussion—vertices coded with this color store metadata describingderived attributes of discussions 305 (see blocks 722, 726, 2638.) Thereis typically one such vertex at the root of a discussion 305.

X: Auxiliary Discussion—some vertices within a discussion 305 may bemarked with this color to indicate specific features such as a splittingpoint or a pivotal message 2402.

Edge Colors:

Note: edges can be heterogeneous, that is they may link together nodesof different colors and types. A specification of the vertex types thatcan be linked via each edge type follows the description. Brackets, { },are used to specify a set of possible types, single headed and doubleheaded arrows, → ⇄, are used to indicate the preferred direction of anedge. The colored graph used in one embodiment of the invention has thefollowing edge colors.

FT: “from-to”—Indicates that a communication was sent from the headidentity to the tail identity.

-   -   Type Specification: AP1.AR→AP1.AR

TT: “to-to”—A relation established between recipients of the same mail.Provides negative evidence towards merging the two recipient aliasrecords.

-   -   Type Specification: AP1.AR⇄AP1.AR

CA: Cluster Alias Records—Edges used during merging of alias records tofind actor personalities 1220.

-   -   Type Specification: AP2.AR⇄AP2.AR

T: Threading—Indicate direct, rather than inferred, links between items(see block 730.) These edges are added prior to the process ofdiscussion 305 building which discovers additional causal relationshipsbetween items.

-   -   Type Specification: T.C→T.C, or    -   T.CE-→T.CE

AT: Attachment—Simply records attachment relationships between messagesand their attachments, which are stored as separate data items in thecorpus.

-   -   Type Specification: T.C→{T.RD, T.C}

HR: Hard Revision—Records direct known relationships between versions ofa document 505, which have been extracted from an external source suchas a source control system

-   -   Type Specification: R.RD→R.RD

CR: Clustered Revision—Links between revisions of a document 505identified through the document lineage assessment process.

-   -   Type Specification: S.RD⇄S.RD

TM: Template—Links between a set of distinct documents 505 that arederived from one base version.

-   -   Type Specification: S.RD⇄S.RD

TC: Topic Cluster—Links between items that both contain text referringto the same topic 315 (see block 734.) Derived via standard practicetopic 315 clustering approaches.

-   -   Type Specification: {R.CE, R.RE, R.C, R.RD}→{R.CE, R.RE, R.C,        R.RD}

SC: Shared Content—Links together data items sharing common textblocks(see block 732.) This can occur when material is cut and pasted from oneitem to the other, or when a reply message quotes material from itsparent.

-   -   Type Specification: {S.CE, S.RE, S.C, S.RD}⇄{S.CE, S.RE, S.C,        S.RD}

SD: Shared Document 505—Links together items related to a commondocument 505 in some way, either via direct attachment, references tothe document 505 in the content of both items, and so on.

-   -   Type Specification: {S.CE, S.RE, S.C, S.RD}⇄{S.CE, S.RE, S.C,        S.RD}

SR: Shared Reference—Links together data items with any other sharedreferences to named entities that have been extracted duringpreprocessing.

-   -   Type Specification: {S.CE, S.RE, S.C, S.RD}⇄{S.CE, S.RE, S.C,        S.RD}

CS: Content Similarity—Links together data items with similar lexicalcontent. This measure is based on the amount of similar material betweendocuments 505. For example it can be used to derive a rough measure ofthe amount of new information introduced in one item relative to anearlier item.

-   -   Type Specification: {S.CE, S.RE, S.C, S.RD}⇄{S.CE, S.RE, S.C,        S.RD}

TR: Trigger—Presence of a link indicated that an event has been found totrigger a subsequent communication in some way

-   -   Type Specification: {R.CE, R.RE}→{R.C}

R: Reference—Links an item to named entities that can be resolved toother items in the corpus.

-   -   Type Specification: {R.CE, R.RE, R.C, R.RD}→{R.NE, R.CE, R.RE,        R.RD}

CL: Clique—Presence of a link indicates that two personalities 1220 arepart of the same social clique, also called a circle of trust 1505.

-   -   Type Specification: R.A⇄R.A

W: Workflow 2409—Presence of a link indicates that two items representsuccessive steps in a formally defined workflow 2409.

-   -   Type Specification: {R.CE, R.RE, R.C, R.RD}→{R.CE, R.RE, R.C,        R.RD}

AW: Ad hoc Workflow 2409—Presence of a link indicates that two itemsrepresent successive steps in a sequence discovered in patterns ofcommunications in the corpus

-   -   Type Specification: {R.CE, R.RE, R.C, R.RD}→{R.CE, R.RE, R.C,        R.RD}

GB: Global Burst—Links two items from the same author with a time delayshorter than the average frequency, relative to a time window, computedfor that author. In practical implementations once the frequencies overvarious windows have been computed for an actor 310 whether or not apair of items are part of a burst of activity is easily computed on thefly, however the link type is described here for sake of consistency.

-   -   Type Specification: R.C→R.C

PB: Pairwise Burst—As for GB, but computed between pairs of actors 310.

-   -   Type Specification: R.C→R.C

CO: Collaborators—Presence of a link indicates that two actors commonlycollaborate on documents

-   -   Type Specification: R.A⇄R.A

D: Discussion—Links items that are members of a discussion 305 (seeblock 728.) Contains a vertex id for the root vertex of a discussion305.

-   -   Type Specification: {D.CE, D.RE, D.C, D.RD}→{R.CE, R.RE, R.C,        R.RD}, or    -   {R.CE, R.RE, R.C, R.RD}→{R.CE, R.RE, R.C, R.RD}

AUX: Auxiliary Discussion—Links items that should be included in viewsof a discussion 305 but are not available as attachment points for lateritems in the discussion 305.

-   -   Type Specification: {R.CE, R.RE, R.C, R.RD}→{R.CE, R.RE, R.C,        R.RD}

CTX: Context—Links external to internal items. These links areconstructed in the post discussion 305 building analysis phase.

-   -   Type Specification: {R.EE, R.ED}→{D.CE, D.RE, D.C, D.RD, R.CE,        R.RE, R.C, R.RD}        Colored Graph Implementation

In one embodiment, the colored graph is implemented as two persistentlookup tables, one for vertices (see block 714) and one for edges (seeblock 716.) Vertices are related to items via a unique index computedfor each item. A vertex is uniquely specified by its color and id. Thusan item can be related to multiple vertices in the graph, each with adifferent color. The interpretation of data stored in the vertices tableentry is determined by the vertex's color. The item id is also used torefer to the item in auxiliary data stores such as inverted indexes. Forthe sake of efficiency and to improve the locality of keys in the vertexand edge tables, data item ids encode their type and a time associatedwith the item. These keys are sorted lexicographically first by type,then time and then on any other attributes encoded in the id. Edges areuniquely specified by their color, and the vertex keys of their head andtail vertices. As with vertices the interpretation of data stored in theedge table is determined by the edge color.

The tables support iterations over keys in sorted order. Iterations canbe specified over results of a range query 320. For sake of efficiencymost evidence computation is implemented as a linear scan over edges inkey sorted order where possible. A large exception is for an embodimentof the final discussion 305 building phase which must traverse varioustrees in the graph.

Vertex keys consist of concatenation of color and unique id. Thusvertexes of a given color and type can be retrieved via a single rangequery 320 versus the persistent data store.

Edge keys consist of the concatenation of edge color, head and tailvertex ids and a flag indicating that the edge is directed orundirected. When an edge is added to the graph a reverse edge is alsoadded to ensure navigability in both directions.

Actor Presence

The set of actors present in a discussion at each interaction definesthe actor presence model 2654. FIG. 26 a illustrates one example of anactor presence model evolution. A discussion is defined by theparticipation of a particular set of actors over time. This set ofactors may evolve as actors are added to or dropped from the ongoingdiscussion. Actor presence 2654 is modeled for the purpose ofcalculating the likelihood that a particular item is part of thediscussion. Additionally, for purposes of reference and of querying, thepresence model is stored as metadata on discussion vertices.

The actor presence model defines several levels of participation:primary actors 2406, contributors 2404, participants 2403, and observers2405. A discussion may have an arbitrary number of actors associatedwith it, not all of whom are equally important. In fact, some actors maybe totally passive throughout the discussion. In such cases, there maynot even be any evidence that they ever opened any of the documentsassociated with the discussion.

In addition, some applications of the invention have the notion of a“viewer” or “reviewer” 2407. This is any user who has viewed individualdiscussions 305 or documents. This user may, or may not correspond to anactor represented in the corpus. Rather this is a mechanism ofmaintaining an audit trail on the retrieval of information in thesystem. To this end, each time a discussion or individual document isviewed or otherwise accessed, its “viewed” count (or other access count)is incremented by 1. A separate viewing log is kept, with a basic recordstructure of: (User ID, Discussion Viewed, Timestamp.)

This audit trail allows the system to track not only which reviewers2407 accessed which discussions 305, but also the relative importance,or popularity among reviewers 2407, of discussions 305. Records are keptof items including (but not limited to) the number of queries 320 thatresulted in a particular discussion 305 being retrieved; the number ofdrill-down examinations of each discussion 305 (for example, accessingits timeline); the number of times the discussion 305 was sent as anattachment.

In embodiments allowing modification of the information, a new versionof the discussion 305 or document 505 is created; annotation informationis added to the log record where present, as are deletions (wherepermitted.)

Modeling Actor Presence in a Discussion

FIG. 26 a is a diagram of one embodiment of an actor presence modelevolution. The primary actors 2406, contributors 2404, participants 2403and observers 2405 fields stored as discussion metadata collectivelydefine the actor presence model 2654 (see blocks 2646, 2648, 2650,2652.) The presence model 2654 measures the level to which each memberis intrinsic to a discussion. In one embodiment of the invention, duringthe construction of discussions candidate data items are evaluated forconsistency with the current actor presence model 2654 when decidingwhether or not the item should be linked into the discussion. Forexample the probability that a primary actor 2406 may drop from adiscussion is much lower than for an observer 2405. Since actor presencecan evolve throughout the lifetime of a discussion, for example actorsmay be added or dropped at various points, the presence model 2654 isnot particularly amenable to clustering approaches. An alternateembodiment of the presence model can be defined using membership levelsin place of the ranking induced by the four categories mentioned above.One embodiment of the invention uses ranking to simplify integration ofheuristic and rule based methods for determining actor presence.

The following section describes a general procedure for constructing thepresence model given a particular tree of sequentially related dataitems. In one embodiment, this procedure is generalized duringdiscussion construction as the problem becomes one of finding a setcausal relations that creates a consistent presence model.

With reference to FIG. 26 a, the status of an actor may change with eachsuccessive message. The pragmatic tags 2630, 2632, 2634, 2636 listed onthe left demonstrate a very typical tag sequence: “Anthony Mavis”initiates the discussion, starts with a question, “Brad Norstrom” asksfor a clarification while inviting “Elvira Quent” into the discussion,“Elvira Quent” gives a response which presumably answers the question,and “Carol Ogilvey” confirms this by thanking “Elvira Quent”.

A simplifying assumption is made that participants 2403 in a discussiongenerally view items in the time order that they arrive. Therefore theevolution of the presence model can be determined by sequentiallyordering all items in a discussion 305 by time. This means that thepresence model for a discussion 305 at any given point in time is thepresence model as calculated for the latest member of the discussion305. In some embodiments of the discussion 305 may have several branchesthat remain contained inside it (as opposed to an offshoot,) only if thepresence model remains consistent. If two branches develop in adiscussion that each have a different subset of actors, the parentdiscussion 305 has split into two new discussions. As elsewhere noted,two discussions 305 may merge if an item is strongly related to otheritems in the same discussion 305, but if the actor presence modelchanges the system will start a new discussion 305, at the point the twomerge. When this occurs the discussion 305 vertex will be marked so thatthe possible merging of the parent discussions 305 will be recorded andcan be queried by the system.

The following types of evidence can be used to evaluate changes inpresence level for the author of each new item as it is evaluated:

-   -   pragmatic tags (R colored vertex metadata)    -   membership in communication time burst clusters (GB and PB        colored links)    -   membership in actor cliques or “circles of trust 1505” (CL        colored links)    -   membership in a collaboration cluster (CO colored links)    -   job title/role and reporting relationships (A colored vertex        metadata)    -   frequency of participation (D colored vertex metadata)

Alternative embodiments may add other evidence types as well asadditional heuristics to those mentioned below.

For the purposes of this description the starting point is assumed to bea list of Communication documents (C) 510 and Communication Event (CE)570 data items. The construction method consists of iterating throughthe set of data items in time order. The actors 310 associated with eachitem are retrieved and assigned a level of presence, the actors 310 areadded with a frequency count of 0 if not already in the presence modelor they are moved to the appropriate rank set if necessary and thefrequency count incremented.

Rank assignment is done on the basis of evidence accumulated during atraversal of items in the set. Most of this evidence can be maintainedas a simple set of tallies. For shorter discussions 305 it is sufficientto sum tallies as, for example, evidence for primary involvement earlyin a discussion 305 should be sufficient for establishing the actor 310as a primary actor 2406 overall. However, when confronted with longerdiscussions 305 or a need for incremental building of discussions 305 atally computed over a sliding time window is more appropriate. For thesake of efficiency, one embodiment, these tallies can be approximatedby:(t _(i) /w)x _(i)+((w−t)/w)x _(i−1), if t _(i) <=w

-   -   (w/t_(i)) x_(i), otherwise    -   where t_(i) is the time interval from the current item to the        last item,    -   w is the length of the window x_(i) is the quantity to be        tallied        Note: the time interval will be adjusted to account for        explained actor 310 absences and regular activity patterns, as        described for other time based calculations used in the        invention.

The tallies recorded may include, but are not limited to:

-   -   count of communications sent, events that trigger other        communications (these events include revisions on a shared        document 505, events 1525 marked as triggers for a        communication, etc. . . . )    -   distribution of pragmatic tags, i.e. tallies of initiator,        clarifier, question, filler, response, and summarizer tags    -   count of communications that are part of a pairwise burst        between author and a recipient without being part of a global        burst for the author.        The Discussion Building Process

Two embodiments of the invention are described for discussion building.Each embodiment derives several types of evidence that are used to joinsets of items into a discussion 305. Each of these embodiments describesan initial construction phase, which yield clusters of items with strongevidence for membership in the discussion 305. Following this phaseanother pass is performed which refines these clusters and addsdiscussion 305 links.

Phase 1: Construct Initial Discussion Clusters

FIG. 24 b is a flowchart of one embodiment of constructing the initialdiscussion clusters. In one embodiment of the invention, discussionclusters will be constructed and recorded by placing hypothesizeddiscussion edges in the graph. Initially a clustering process isperformed on the actor sub-graph (Block 2412). Since the discussionbuilding process may provide new evidence that can be used to furthercorrect the actor graph, the actor graph will be rebuilt afterdiscussions have been built. This clustering pass provides anapproximate set of actors who interact the most frequently andsubstantively with one another.

In addition to the previous passes of analysis performed, such ascomputation of near duplicates and textblock identification, passes ofclustering over indigenous corpus items are performed. These involvewell-understood clustering techniques and are therefore not elaboratedupon here. Topics are identified using topic clustering. The clustersare recorded in the colored graph with TC-colored links (block 2413.)Clusters of items with similar lexical properties are identified, forexample via number of shared frequent lexical items or shared frequentcollocations. These clusters are recorded with CS-colored links (block2414.) Sequences of documents which are part of formal workflows areidentified and the sequences stored with W-colored edges (block 2415.)As described elsewhere in the patent ad hoc workflows are computed andrecorded with AW-colored edges (block 2416.) Note that in one embodimentof the invention, the computation of AW-colored edges requires aninitial pass of discussion building in order to accrue a sufficientamount of evidence. In continuous or incremental versions of the system,this can be done after the initial processing of the corpus, and thenre-computed at the same interval as structure documents arere-evaluated, or at some constant multiplied by this interval, since adhoc workflow processes do not change frequently. In forensic versions ofthe system, an optional second pass on the corpus may be performed inorder to include AW-colored edges as an evidence source.

In one embodiment, extraction of named entities (Block 2417) is based ona mixture of techniques, including searching for known named entitiesthat have already been identified. Known standard patterns, includingbut not limited to URLs, email addresses, host names, and IP addressesare identified using regular expressions. A shallow parser (aspreviously described for pragmatic tagging) may be employed to identifypotential named entities such as proper noun phrases. If a shallowparser is invoked, perform pragmatic tagging (though some embodimentsmay opt to delay until the final phase of discussion building as onlythe items found to be in a potential discussion cluster need beexamined). All known and suspected named entities denoting specificdocuments or resources, such as URIs, are placed in an inverted index.Remaining known and suspected named entities are placed in a separateinverted index. (A suspected named entity would be one or morecontiguous tokens that had been pragmatically tagged as an entity, butan otherwise unknown one—for example, an unknown actor 310 name.)Clustering is performed over the first index in order to buildSD-colored edges, for documents 505 that share references to externaldocuments 525. If any documents 505 can be found within the corpus,R-colored edges may be added to the items containing the document 505reference. Clustering is performed over the second index in order tobuild SR-colored links for groups of items sharing common references.(The preceding paragraph corresponds block 2417.)

Literal communication threads such as reply-to, forward, newsgrouppostings, and the like, as well as workflows 2409, shared attachments,and shared content (e.g. textblocks and quotations shared betweendocuments 505) will be used to provide a “backbone” that potentialdiscussion 305 clusters will build on. As most discussions 305 will beinitiated and sustained through communications, start by identifyingthread “clusters” (see blocks 2418, 2419, 2420, 2423.) Note that some ofthis information, such as membership in strict email threads can beconstructed through means other than clustering analysis. This is notnecessarily the same sense of cluster as for the other clusteringsdescribed here. Since thread clusters are well ordered, create atentative D-colored vertex at the root of the thread cluster (see block2421.) Walk the tree of T-colored edges and create a correspondingtentative D-colored discussion 305 edge (see block 2422.) Tentativediscussion 305 edges are decorated with a flag that gets set when theedge is either finalized in phase 2 or removed. Similarly find allW-colored clusters and if the root does not already have an outgoingD-colored edge then create a tentative D-colored vertex and add edges asfor threads (blocks 2424, 2419, 2425, 2421, 2426, 2423.) In oneembodiment, AW-colored clusters are treated similarly (blocks 2427,2419, 2428, 2421, 2429, 2423.)

In one embodiment of the invention the actors associated with eachinitial discussion cluster may be augmented by the following thisprocedure. For each discussion cluster (blocks 2430, 2419, 2428)identify all associated actors and if all those actors fall into one ofthe actor cluster's (computed in the first step), augment the set ofactors recorded for the discussion cluster with members from the actorcluster (blocks 2431, 2432.) For those discussion clusters that haveassociated actors that fall into more than one actor cluster, remove allactors who were only cc'd or bcc'd in the discussion cluster (block2433) and repeat the test (block 2434.) If the remaining group fallsinto one cluster then associate the discussion with that cluster (block2432.) Associate the remaining discussions with the union of therespective clusters in which their associated actors occur (see block2435.) This approach will potentially cause discussion clusters to growlarger than they otherwise would have when going through the expansionsteps below and therefore the system less likely to miss elements thatshould have been included in a discussion. An alternate embodiment cansimply start with the thread clusters and the identified actorsassociated with items in the cluster.

This step will identify communications that may be part of the samediscussion on the basis of actors coordinating modifications on adocument. For each discussion (block 2430, 2438, 2449.), collect the setof all document items reachable from an AT-colored link originating atan item in the cluster and all document items reachable throughR-colored links added during named entity extraction (block 2439.) Foreach document item in the set (blocks 2439, 2440, 2448), identify allother document items reachable through CR-colored links (block 2441),i.e. near duplicates of the attachment that are not in a templatecluster. Retrieve modification events pertaining to these documents(block 2442)—including creation and deletion or removal from arepository. Scan for the modification events (blocks 2442, 2443, 2447)which both occurred within the lifespan of the discussion cluster andwere committed by one of the actors associated with the discussion(block 2444.) Recall that this set may have been augmented from thecontents of actor clusters. Additionally comparison of the time of theevent to actor heartbeat and other similar constraints may be applied(block 2444.) Add tentative discussion edges from the previous and nextrevision to the modification event that lies between them (block 2445.)Now update the list of actors associated with this “expanded thread” toalso include those identified in any of the modification events thatwere pulled in, or who were affiliated with the document by dint ofauthorship (block 2446.)

The next steps examine events for inclusion. FIG. 24 d is a flowchart ofone embodiment of incorporating events into discussions. These eventstypically provide sustenance for a discussion. An example of eventsustenance might be a document modification or access event following acommunication about the document, which then allows the system to findrelationships with items related to the next version of that document.This step is somewhat heuristic in nature in that there are a largenumber of sustenance “scenarios” that must be handled on a case by casebasis for different event types. The overriding principles are that thecommunication event must relate to an item already in the discussioncluster and relate to following items that would therefore get includedin the discussion. Addition of these events is constrained by the samefactors as the previous step and will update the actor set associatedwith each discussion cluster as above. Event processing includes but isnot limited to the techniques described below.

Calendared meetings represent another important event type (block 2125,2140), since often important communication among actors may occur atmeetings, leaving no other electronic evidence trace. Calendaredmeetings are associated with the appropriate cluster of actors based onthe actors noted in the meeting entry. If the meeting has a semanticallymeaningful title (i.e. something other than “meeting”,) or description(see blocks 2472, 2475), the system will try to determine a topic forthe meeting. In this case, the discussion clusters whose actor setscontain the meeting's actor set and match on topic and general timelinewill be considered candidates for the inclusion of the meeting. However,if any of the possible candidates contain content which refers to ameeting on that date and time, the calendared meeting will be includedin those discussion clusters and not any others. If the meeting has atranscript or meeting minutes associated with it (as determined by dateand time and meeting attendees, see block 2467) the meeting will beattached to whichever discussion clusters contain the transcript (seeblocks 2466, 2468, 2469) In one embodiment, it will also be used todetermine topic (block 2473). Failing both of these, a meeting will beattached to any discussion for which it is timeframe appropriate, andwhich contains at least some of the set of actors from the meeting (seeblock 2481). Note that there is an important exception to this; asecretary or admin may be substituted for the actor she or he supports.In one embodiment, such relationships may be added by the user.

In one embodiment, calendar events 2125, such as “end of quarter” willbe inserted if there is content present referencing them (block 2476).In another embodiment, when and whether to add such events 2108 globallyis user-configurable (block 2478).

Voicemail and phone events 2114 will be automatically inserted if theyare between two actors 310 who are either primary actors 2406 orcontributors 2404, and if they fall within the required time range(block 2471). This is because failing any additional information, thereis no way to know which of the possible discussions 305 among theseactors 310 it should be linked into. Therefore, the system takes theconservative path and includes them in all such discussions 305. Thisbehavior is helpful for investigative purposes; when a witness is beinginterviewed, they can perhaps clarify the purpose of the phone call. Inthe event that they do, the user may add a meta-document 560 containingthe witness' testimony and have that document 560 be considered arelated document to that event. Note that this may potentially cause anevent to be removed from one or more discussions 305 and potentiallyadded to one or more others.

Employee or actor lifecycle events 535 are added automatically by thesystem in order to explain or annotate an actor's presence or absence(block 2474). For example if Actor A started off participating in adiscussion and then drops out of it, a lifecycle event 545 noting thathe had been transferred to another division would be automaticallyinserted.

All other types of events must meet the actor 310 and time test, butmust also have been explicitly mentioned in the content. This is toprevent floods of events with limited or useless information fromgetting attached to numerous discussions 305. For example, an actor 310may have a task list that commingles professional tasks and work itemswith personal ones, such as going grocery shopping. Clearly it would notbe appropriate behavior for all checking off of such tasks to beappended to any discussions 305 that that actor 310 had participated in.

Next, “solo” communications that do not appear in any T-colored edges,are examined (see blocks 2450, 2452, 2459.) These mails are linearizedin time, accounting for clock drift, tagged by the IDs of theirattachments, and associated with a particular actor 310 cluster aspreviously described (see block 2451.) The system now considers theevidence constructed during the earlier clustering stages, namely theSR, SD, SC, CS and TC edge colors. If the solo is related to one or moremembers of a discussion cluster via one of these edges (see blocks 2453,2454, 2458) and it matches on actor 310—if the actors 310 associatedwith the solo are a subset of actors 310 currently associated with thediscussion cluster—and falls either within the time span of thediscussion cluster or within an interval of I of it, where I is thelength of time from the occurrence of the first element to the last (seeblock 2455), create a discussion edge to the solo (see blocks 2460,2461, 2462.) If there is an overlap in actors 310 between the solo andthe expanded thread, but not complete containment, the solo will beplaced in a holding set for later possible “best fit” placement (seeblocks 2456, 2457.) Additionally a solo may become the new root for adiscussion cluster (see block 2460), in which case the prior root isremoved and all attributes transferred to the new root vertex (see block2462.)

Next the discussion 305 clusters themselves are clustered. In analternate embodiment, only the communication thread clusters are merged,resulting in less “spread” of the clusters. Clustering proceeds on theSR, SD, SC, CS, TC edge set in addition to the actor set associated witheach discussion/thread cluster (see block 2501.) This is done in orderto determine which clusters should be merged in the same discussion 305(see blocks 2502, 2503, 2504, 2508, 2509.) Discussion/thread clusterswhich end up in the same cluster are possible candidates forunification. Discussion 305 clusters are unified by adding discussionedges corresponding to SR, SD, SC, CD and TC edges between members ofthe respective clusters (see blocks 2514, 2515, 2518), from the vertexwhich occurs first to the other (see blocks 2516, 2517.) These links areonly created if the time span is consistent for that pair (according toactor heartbeat, for example). This is determined as follows:

-   -   If discussion A and discussion B do not overlap in time, but one        ends within a distance of I of the other, where I is the max        time span(A, B), they are unified (see blocks 2505, 2506.) (Note        that some embodiments may use other means of determining the        value for I, for example I could be the mean time between        messages in the thread multiplied by a constant.)    -   If discussion A and discussion B occur further apart than I from        one another, they will not be unified unless there is no other        thread in the same cluster (see block 2507). This is because        communications on a topic 315 that is incredibly rare may very        well be related no matter how far apart in time they occur.    -   If Thread A and Thread B do overlap, or are concurrent in time,        they will be united only if the actor 310 involvement is        substantially the same between the two (see block 2513.) In one        embodiment, this is determined as follows: the actors 310 are        ranked by how many emails they sent, and how many they received.        The former is scored at twice the weight of the latter. If any        actor 310 receives a zero in either thread, the two threads        cannot be joined. This is because the absence of a particular        actor 310 may be the very reason for the existence of the second        thread, and therefore the system would be ignoring potentially        important sociological context if it were to combine the two        threads. In some embodiments, the ordinal ranking of the actors        310 between the two threads cannot differ by more than one        position in order for the threads to be joined. In other        embodiments, this is simply a requirement that the actor 310        presence models of the two threads be identical.

As the next to last step, the question of what the proper time frame forthe discussion is recalculated based on the date and timestamps of theitems that are currently first and last chronologically (see block2510.). As a result of this possible lifespan change, additional itemsmay be added to the discussion 305. This includes any solos being heldin a possible fit category will be re-evaluated (see block 2511.)

At this point, these “proto-discussions” are really candidatediscussions 305 awaiting a sanity check or refutation test pass.Discussion 305 links will be finalized in pass 2 (see block 2512). (Seethe subsequent section entitled “Finalization of Discussions.”)

In another embodiment of the invention, discussions 305 are built via aseries of constrained item clusterings, where the valid item types are:communication documents 510, regular documents 515, and all kinds ofinternal events (see block 515, 510, 2102 in FIG. 25 b.) During eachphase the clusters obtained are decorated by the elements that describea discussion 305. The decorated top-level clusters are the discussions305. In the next clustering phase each cluster is considered as a singlepoint and is linked further with other clusters. Conceptually this is aform of hierarchical clustering, but each level of the hierarchy isbuilt with differently tuned algorithms and the larger clusters that alevel produces have a particular interpretation and are decorated withdifferent information. (In some cases the clusters created at a lowermay be reconsidered.) Specifically, this is implemented as follows:

-   -   The base set of vertices is the set of all items. (Recall that        exact duplicates have already been eliminated by the initial        indexing process, but all meta-data information for the        duplicates, such as location of occurrence, and timestamp is        retained.)    -   The first “clustering” phase considers only communication        documents 510. In this initial pass, each reply chain or        “thread” of emails is clustered together (see block 2521.) This'        phase recreates the “reply-to” and “forward” relationships        between emails with T-colored links. T-colored vertices are        created to track metadata used during the clustering procedure.    -   The next clustering phase clusters together near-duplicate        regular documents 515 (see block 2522.), as was previously        described. In the process, the sets of metadata for each        exact-duplicate instances are merged into a set of metadata for        each Near-Duplicate instance. Note that this clustering is not        applied to communication documents 515 at this point in the        process because such documents are versionless.    -   The next clustering phase covers document 505 metadata matches        such as by title or subject, and workflow 2409 (see block 2523.)    -   The next clustering phase considers all attachments, and other        documents 505 linked to communication documents 510, including        those incorporated by reference or hypertext link (see block        2524.). For the following description, all such documents will        be referred to as “attachments.” For each attachment, the system        constructs a ranked list of documents 505 that are proximate to        the attachment in the derivation history tree produced as a        result of the previous process. For communications 510 that        “score high”(are compatible) on all counts (derivation history        closeness and shared attributes): create a cluster link between        the attachment and the communication 510. The link extends from        the attachment to its communication 510. If the second        communication 510 is also an attachment, then the link also        extends to this item. In this case the effect is to merge into a        single discussion 305 two communication threads that exchange        closely related versions of an attachment. Mark the resulting        cluster (tentative discussion 305) with the Attachments used to        compose it.    -   Apply the same technique as the previous replacing Attachment        with the communication 510 body for only those items which share        more than 2 textblocks with a regular document 515 (see block        2525.) This step is performed in order to catch the cases where        a draft document has been typed directly in the communication        510 rather than in a text processor. This also catches the cases        where an email agent or similar mechanism has converted an        attachment to inline content.    -   Then perform the event integration as described in the previous        embodiment, and as noted in FIG. 24 d (see block 2526.)    -   In one embodiment, the following step is performed involving        negative evidence and link-breaking: for each tentative        discussion 305 apply a fast bisection over its items (see block        2527.) (This is a step usually taken in a bisecting top-down        clustering) The bisection dimensions take into account text        features but also time and actors 310. Consider the two        resulting halves of the discussion 305. If the computed halves        are not compatible with the discussion 305 topology (for example        are randomly dispersed in an email thread) then stop: the        bisection negative test is inconclusive. Compute a dissimilarity        measure between the two halves. (For example: mutual        information). If the dissimilarity is small then stop: the        bisection negative test is inconclusive. Else the test may be a        sign of topic drift within a sustained discussion 305 among        actors 310. Or it may be a sign of mistaken choice of a link        during clustering. To solve both problems, cut the discussions        305 into 2 pieces along a minimum cut of the cluster edges (as        determined during the topology compatibility test). Re-pass the        resulting smaller discussions 305 through the negative evidence        test so they may be broken down further if appropriate (see        block 2528.)        Phase 2: Finalization of Discussions

Following the accrual of evidence via the edges added to the coloredgraph in the first phase D-colored vertices and edges are added to thegraph to represent the final form of the discussion 305. The primarytechniques that are used in this phase are the construction of the actorpresence model 2654 and construction of a Markov chain of pragmatic tags2301.

This phase has several purposes. Since much of the evidence accrual tothis point has been based on clustering and pairwise relationshipsbetween items, an extra pass is required to filter out items that do notmatch with the history and evolution of accumulated discussion 305properties. This procedure also resolves any remaining ambiguities as towhere individual items should be placed in the discussion 305. Theevolution of the actor presence model over time is used to determinewhere discussions 305 end, split or merge.

The procedure first evaluates a set of candidate items, as well as whichitems could be considered their direct parent in the discussion 305,recording the results for each. It then walks back through this recordand creates discussion 305 edges. It uses two secondary data structures,a chart (see block 2588.) and a lookup table (see block 2589.) indexedby item id. Chart data structures are standard practice in a large classof parsing algorithms. Briefly a chart is a sequential list of verticesto which edges (see block 2590) are added which are decorated with datarecording the result of a computation involving all vertices between thehead and tail of the edge. For the purposes of this procedure, it can beimplemented as an array with one entry per vertex. Each entry containsanother array which contains all edges that terminate at that vertex. Anedge consists of a record containing the head and tail vertex indicesand any other data necessary to record prior results. The chart issupplemented with a lookup table that associates an item id with theindex of the chart vertex that was current when that item was evaluated.Chart edges keep track of a candidate vertex (see block 2595), theproposed parent of that vertex (see block 2596), a proposed structuraloperation which determines whether the vertex will be added to thediscussion or not (see block 2593), a cut flag (see block 2597) which isused for bookkeeping, a parent edge (see block 2598) by which a sequenceof operations leading back to the beginning of the chart can be tracedand the score (see block 2594) for that path. In addition at each edgethe current overall actor presence model is kept and a separate actorpresence model is kept for considering only the parents of the end itemon the edge. The pair will be compared in order to look for discussion305 splitting points. The models can often be shared across multiplearcs as seen below, and other schemes such as only recording changes canbe used to reduce memory consumption if necessary.

The procedure uses a fixed size window to delay decisions until thedownstream implications of those decisions can be taken into account.The size of the fixed window can be determined using parameterestimation techniques. Therefore the chart could actually be implementedas a FIFO queue or a circular array.

The procedure is as follows:

-   -   Start with a schedule of item vertices to be examined (see block        2532) For a non incremental process the schedule can be        initialized by scanning the colored graph looking for D-colored        vertices. In an incrementally updated graph the process is more        complicated. Vertices that were left unresolved from the last        pass are added to the schedule if new vertices are found to be        related to them. In addition all D-colored vertices marked as        tentative (i.e. they were added during the last incremental        update) are added to the schedule.    -   Remove the next item vertex from the schedule (see block 2533),        this becomes the root item vertex.    -   Initialize the chart, chart lookup table and set the chart        vertex id to zero (see block 2534.) If the vertex is a        discussion 305 vertex (see block 2535) then add an initial arc,        proposing an ‘add’ structure operation (see below for the role        of the proposed structure operations), and set the chart vertex        index to 1 (see block 2537.) Otherwise, initialize the lookup        table with member vertices of the discussion (see block 2536.)        Each member should be placed in the table as its own value, i.e.        if a vertex A is the key, A is the value retrieved for the key.    -   Construct a list in date/time order of vertices that are        reachable via tentative discussion edges (see block 2538.) This        list should be constructed incrementally during the traversal        rather than ahead of time.    -   For each item added to the chart increment the chart vertex (see        blocks 2556, 2557.)    -   For each tentative member vertex first check to see if it has a        parent in the lookup table, if it does not then skip it (see        blocks 2539, 2540, 2551.)    -   Determine the change in author's status in the overall presence        model (see block 2541.) Keep track of this change as it will        also be applied to the models kept on each arc.    -   For each proposed parent (see blocks 2542, 2543, 2550.) propose        a set of edges which correspond to the following proposed        structure operators:        -   add item to discussion, 305        -   split item from discussion, 305        -   cut item from discussion, 305        -   if vertex already a finalized member of another discussion            merge it with the current discussion        -   (see blocks 2544, 2545, 2546.)    -   For each proposed edge (see blocks 2547, 2549) determine a        likelihood score (see block 2548.) This score is determined by        comparing the state of the discussion as recorded on a prior arc        in the chart with the current state of the discussion. This        score is calculated from three main types of evidence.        -   1) The likelihood that the detected change in the actor            presence model is consistent with the structural operator            proposed for the current edge (see blocks 2561, 2567.) In            other words, is the set of actors 310 involved with the            current vertex consistent with branching off to a new            discussion 305, or continuing the current one, or merging            with another discussion 305. To calculate this the prior            actor presence models, both the overall and the branch            presence models are required (see blocks 2591, 2592.) The            score is obtained by mapping the previous and current actor            presence models in combination with the structural operation            to states in a Markov model. This procedure is very similar            to that described for pragmatic tagging. Multiply the prior            arc's score by a scaling factor then by the transition            probability trained into the model for that pair of states.        -   2) The likelihood that the proposed structural operation is            consistent with prior operations proposed in the process of            building up the discussion (see blocks 2562, 2568) Some            embodiments may choose to incorporate other attributes of            the discussion into the training of the second Markov model.        -   3) The likelihood that the pragmatic tag set of the current            vertex item follows from the tag set of the proposed parent            (see blocks 2563, 2569.) The score can be determined by            finding the maximum scoring pair of tags out of the parent            and child tag sets (recall that an item may have more than            one textblock, and each textblock has one pragmatic tag).    -   The chart actually stores several sequences of modifications to        a discussion 305 and the various states of the discussion 305        that would be obtained by making those modifications. The        overall purpose of the algorithm is to find the most likely of        those sequences. This is obtained by iterating through the list        of chart edges. incident on the preceding chart vertex (see        block 2559, 2560, 2565, 2566, 2572), and selecting the highest        scoring of those edges as the parent chart edge for the current        proposed edge (see blocks 2564, 2570, 2571.)    -   When the highest scoring parent edge is found, the new actor        presence models are calculated and stored in the current edge        (see blocks 2564, 2571, 2578.) This is followed by some final        bookkeeping (see blocks 2577, 2576, 2575, 2574.) (Note:        Implementations need not store an entire presence model, for        example one embodiment would store only the changes in the actor        310 model at each step. If the presence model to state mapping        function described below is used then the actual presence model        never needs to be stored, only the prior state and the changes        recorded at each step are sufficient to calculate the next        state.) However, there is a twist in that the models are updated        differently depending on the structural operation proposed for        the current edge, as follows:        -   1) If the proposed operation is add, merge or split then            apply the actor status change calculated earlier, and update            the presence models from the winning parent vertex edge with            actors associated to the current item (see block 2564,            2571.)        -   2) If the proposed operation is cut then create a new            overall and branch presence model which is initialized with            actors associated to the current item (see block 2564,            2571.)    -   After completing all arcs for the current vertex the discussion        is updated (see blocks 2552.). Select the highest scoring chart        edge on the current chart vertex (see blocks 2579, 2580) and        walk backwards through parent arcs (note: NOT the vertices where        item parents stored) for the window size (see block 2581). If        the chart edge so obtained has the cut flag set (see blocks        2574, 2582.) delete the chart vertex, move the following        vertices down by 1, update indices in the chart lookup table and        reduce the current vertex id by 1 (see block 2585.) (In other        words skip over descendants of cut or split items until the        window fills with items to be added to the discussion). Then        apply the proposed structural operation on the chart edge to the        current discussion:        -   1) For add and merge operations, if parent a member of the            discussion 305, create a discussion 305 edges (see block            2583, 2586.) (If the parent is not a member, then this item            descends from an item that was cut). Store the discussion            305 id, overall presence model the score and other necessary            data on the discussion 305 edge.        -   2) For cut and split operations add the item vertex to the            schedule mentioned in the first step (see block 2583, 2587.)    -   Termination (see block 2558.) is handled differently for        incremental and non-incremental processes. For a non-incremental        process (see blocks 2553, 2555, 2556), select the highest        scoring edge on the last vertex walk back window size-I steps        and apply structural operations (e.g. cut, etc) walking forward        along same path. For an incremental process save unprocessed        items so as to initialize the next incremental pass as described        in the first step (see blocks 2553, 2554, 2556.)

The embodiment listed here uses supervised methods to calculatelikelihood scores for the general procedure above. There are 3 areasthat require likelihood estimation. First, that the changes in apresence model are consistent with a proposed structure operation, giventhe prior presence model(s). Second, that a pragmatic tag is consistentwith prior tag(s), given a proposed structure operation. Third, that aproposed structure operation is consistent with other general attributesof the discussion.

The preferred embodiment uses Markov models, similarly to the techniquesused for assigning pragmatic tags. The state space for each model isbased on a combination of structural operations and other attributes ofthe system. Changes in an actor presence model will be scored by numberof actors added, dropped or moved multiplied by a weight factor based onwhether the actor 310 is moved into or out of the primary actor 2406,contributor 2404, participant 2403 or observer 2405 categories.Pragmatic tags are well defined and the state space will simply be thecross product of structural operations and each possible pragmatic tag.The third category can define a state space based on quantizingdifferent evidence types, the exact choice dependant on the first phaseoperation, for the embodiment here whether the elapsed time from thelast communication in the discussion falls into an extreme trough anextreme burst or in-between (previously described). Particularly fordiscussion 305 building, n-order models are preferred where n is nolarger than the window size defined for the above procedure. Given thesomewhat open ended nature of the first definition, either a sparsematrix should be used as discussed for pragmatic tagging training or thescore should be bounded by a reasonable value (about 50% of a largediscussion actor set).

The models are trained by running the system through the first phase ofdiscussion 305 building, then extracting an appropriate sequence oflabels from discussion 305 clusters and manually labeling transitionswith appropriate structural operations, taking care to include cut andsplit nodes. Internal parameters and thresholds can be set by parameterestimation (i.e. evaluating accuracy over multiple runs with differentparameter settings). The training procedure is more complicated than theprocedure defined earlier in that it creates and updates an internalpresence model in the same manner as the phase 2 procedure above inorder to use the state spaces defined above.

The update procedure for actor status of the author of an item is basedon the tallies and evidence types listed under the definition of theactor 310 presence model. Tallies are kept for:

-   -   pragmatic tags (R colored vertex metadata)    -   posts that fall in communication time burst clusters (GB and PB        colored links)    -   frequency of participation (D colored vertex metadata)

Thresholds for these tallies can be determined from the same type oftraining data mentioned above by standard techniques. Pragmatic tagsprovide some of the strongest evidence for level of presence in adiscussion. Actors 310 with higher counts ofinitiator/clarifier/summarizer tags tend to be contributor 2404 orprimary actors 2406.

The following evidence types can also trigger a modification of presencelevel:

-   -   job title/role and reporting relationships (A colored vertex        metadata)    -   in actor cliques or “circles of trust 1505” (CL colored links)    -   membership in a collaboration cluster (CO colored links)

Discussions 305 containing a manager and their direct reportsautomatically assign a higher participation level to the superior, forthe embodiment here these actors are moved one level higher than theywould normally be considered. For Discussions 305 centered around aclique or collaboration cluster, all members are assumed to be at leastparticipants 2403 because they likely use other communication channelsin addition to what can be tracked in the corpus. However there is alsoa higher requirement for becoming a primary actor 2406 in a discussion.So the effect on these types of discussions is a tendency towardscontributor 2404 and participant 2403 levels.

In one embodiment of the invention, a machine learning mechanismincluding, but not limited to a neural network or genetic algorithm isused to optimize weights in the system.

Refine Actor Graph

Once the discussion 305 building process is complete, the system takes afinal pass at correcting the actor 310 graph. In one embodiment of theinvention, it does this by evaluating the actor 310 presence shiftswithin all discussions 305, and seeking patterns in which the presenceof different actors seems to sequentially alternate. For example, thecase in which Actor D is involved in the first several items of adiscussion 305, then disappears for the next several items, but Actor Eappears in his place for a while. Similarly, such alternation maycorrespond to particular topic(s) rather than an interval of time. Eachoccurrence of such an alternating pattern is recorded. In the event thatany pair of actor 310 identities occurs with a statistically significantfrequency, it is reasonably likely that Actor D and Actor E are in factthe same person, but prior to the evidence offered by the discussions305, there was no way for the system to be to draw this conclusion. Inone embodiment, such suspected unmerged actor identities are presentedto the user for acceptance or rejection. In others, the merging is doneautomatically. In either case, the two prior distinct identities will betreated as two personalities 1220 of the same actor 310. If the merge isautomatic, the personality 1220 that corresponded to the anonymousidentity will be considered the secondary one. In some embodiments, thispersonality 1220 will be flagged as having been inferential. Note thatthis will in turn cause circle of trust 1505 information, as well as anyother actor 310 graph dependent data to be recalculated as well.

“Related” Discussions

In one embodiment, discussion A and Discussion B will be consideredrelated to one another if any of the following are true:

-   -   B is an offshoot of a communication event 570 in A. For example,        a subset of actors initiating their own discussion in the midst        of another discussion.    -   B is an offspring of A; a message “borrowed” from A becomes part        of one of the elements of an otherwise separate discussion B        (with a different actor 310 presence)    -   If A and B merge into one discussion 305 C, A, B, and C are all        considered to be related

One embodiment of the invention also defines other kinds of relationswhich have a semantic dimension. Such relations include, but are notlimited to, the following:

-   -   All discussions 305 that match on topic 315 which occurred        proximate to a particular time, or within a given time interval    -   All discussions 305 on a particular topic 315 within a        particular circle of trust 1505    -   All discussions 305 that match an ad hoc workflow 2409 among the        same cluster of actors 310

In addition, in one embodiment, the system will consider discussionsthat have highly similar characteristics to be related. Such similaritycan be determined by hierarchical clustering or similar techniques.

Ad Hoc Workflow

Not every corpus will have a formal workflow management systemassociated with it. Even if one is present, it is highly unlikely thatit in fact captures all workflow processes, both formal and informal.However, much of the day to day work at corporations is performedthrough often repeated informal or ad hoc workflow processes. Thepresence of ad hoc workflows 2409 is suggested by repeated patterns ofany of the following:

-   -   The appearance of two or more templated documents 575 in the        same relative order (Block 2602).    -   Repeated sequences (A→B→C→A) of communications or actions among        individual actors 1210 A, B, C, members of the same actor groups        (for example, actors 310 in the same department,) or of actors        with the same job titles. In one embodiment, this test may have        topic 315 and pragmatic tag limitations.

These patterns are then tested for statistical significance (Block 2612,2614, 2616, 2618). Detecting ad hoc workflow 2409 (see blocks 2602-2618)is important because it allows the system to detect missing items. Oneobvious example of an ad hoc workflow 2409 is the pair of events of apurchase request being made and then granted. If there is a document 505that asserts the request was granted, there must at some point have beenthe document 505 that initially made the request. In addition, once suchad hoc patterns have been determined, they can be queried on. Further,in one embodiment, including incremental or continuous updating versionsof the system, prior existence of specific ad hoc workflows 2409 is usedas a hint during discussion building.

Summary

All discussions 305 have an automatically generated summary 2410. In oneembodiment of the invention, this discussion 305 is of the followingform:

“A discussion led by <Primary Actors> 2406, with active participationfrom <Contributors> 2404 and others, over the period of time spanning<Discussion Lifespan.>The primary topics discussed were <discussiontopics>. If no topics have been identified, “about <named entities>”will be used instead. In the course of the discussion 305, the followingdocuments were modified <document names> by <actors.>” (Blocks 2705,2710, 2715, 2720, 2725).

This last sentence only appears when more than document 505 wasmodified. If only one document 505 was modified, the sentence isrewritten as “In the course of the discussion 305, the <document type><document name> was modified by <actors.>”

If the discussion 305 contains a workflow process, a sentence will beadded to the summary 2410 indicating this: “The <workflow process type><workflow process instance> was <initiated|completed|terminated.> (Notethat in order for the system to recognize that a workflow 2409 had beenterminated, it would require a template for the workflow process inquestion.) (Block 2730)

In another embodiment of the invention, participants 2403 and observers2405 are also named explicitly instead of just being specified as“others.” In still other embodiments of the invention, using standardlexical summarization techniques, the most representative sentencesauthored by each of the primary actors 2406 and contributors 2404 arealso included in the summary 2410, in the chronological order in whichthey appeared in the discussion 305. If pragmatic tagging had identifiedagreement, or disagreement between actors 310 in relation to aparticular assertion (Block 2735), this information may optionally beincluded as well: “<Actor> asserted <assertion>; <actors><agreed|disagreed.>” An assertion in this case is the final sentenceprior to the lexical markers indicating agreement or disagreement. Inyet another embodiment, even if there are no such markers present toindicate agreement or disagreement, the presence of lexically dissimilarresponses to a particular assertion will be considered a divergence ofopinion, and each such distinct response will be listed along with theactor 310 who provided it: “<Actor> asserted <assertion>;<actor_1>responded <response>, <actor_2> responded <response>.” Where responsesare lexically similar, or explicitly state agreement, they may becollapsed: “and <actors> agreed. Actor 310 responses are listed in thechronological order in which they occurred in the discussion 305, sinceone actor 310 response may influence another.

In one embodiment, the user may modify the template provided by thesystem. If a discussion 305 is revised as the result of the addition ofnew data to the corpus, the summary will be automatically regenerated.

Resolution

Whereas the intent of a summary 2410 is to summarize the discussion 305from beginning to end, the resolution 2411 focuses on the outcome of thediscussion 305. The term “resolution” 2411 has two related uses in thesystem: 1) the outcome-oriented summary of the discussion 305automatically generated by the system, and 2) the actual content in thediscussion 305 that contains the resolution. Note that 2) is notnecessarily the final item of a discussion 305, though it will generallybe towards the end of a discussion 305. Further, a resolution 2411 maysometimes only be inferable, rather than provable, based on theelectronic evidence alone. For example, the resolution 2411 to adiscussion 305 may occur in a communication event 570, such as aconference call.

In order for the system to produce 1), it must first locate 2)—if itexists. In one embodiment, this is done according to the followingprocedure:

-   -   In the event that the discussion 305 contains a workflow process        (Block 2802), the outcome of this workflow process is considered        to be the resolution (Block 2804) of the discussion 305, and an        appropriate resolution template is generated (Block 2828).    -   If this is not the case, the system applies the following        heuristic, walking backwards from the tail of the discussion        305:        -   Locate the first communication document 510 that is not            pragmatically tagged as an acknowledgment or other “low            content” communication (Block 2806).        -   Locate the first communication document 510 from the            organizationally highest ranking actor 310 in the discussion            305. If there is no single highest ranked actor 310,            disregard (Block 2808).        -   Locate the first communication event 570 among the primary            actors 2406 and contributors 2404 (Block 2810) in the            discussion 305.        -   Similarly, the first communication event 570 among these            actors 310 and their managers (assuming that this data is            available) (Block 2812).        -   Whichever of these is found closest to the tail (or at it,)            will be considered the resolution 2411. If the only such            item is at the head, disregard it.    -   If none of these items are found to exist (Block 2814), again        walking backwards from the tail of the discussion 305, locate a        communication document 510 containing an attachment (Block 2816)        or document link that was distributed among the primary actors        2406 and contributors 2404. Disregard the item if it is at the        head of the discussion 305 (Block 2818).    -   Similarly, failing this, the system will attempt to locate a        document 505 that contains lexical markers such as “resolution”        or “answer” (Block 2820).

Finally, if none of these items can be found in the discussion 305,apart from its head, the discussion 305 is determined to have noresolution 2411. This is a valid state; ways that it can occur includethe following:

-   -   The discussion 305 was resolved informally, (i.e. around the        proverbial water cooler.)    -   The discussion 305 never achieved resolution; the participants        2403 abandoned it.    -   The discussion 305 did achieve resolution, but outside the time        frame which is being considered (Block 2826).    -   Similarly, but no record of the resolution still exists (Block        2830).

In the event that there is no resolution (Block 2850), in oneembodiment, the automatically generated resolution 2411 will simplystate “None” (Block 2862) in the resolution field. If the resolution2411 is presumed to have occurred inside an opaque communication event570 (such as a phone call for which there is no transcript available,)the resolution field will indicate (Block 2868) in one embodiment: “Theresolution may have occurred during a <communication event type> meetingbetween <actors> on <date time.>.” If the information regarding thisevent 570 was extracted from an online calendar which specified alocation for the meeting, one embodiment uses the following information:“at <location.>”

If the resolution 2411 has actual content associated with it, the actor310 responsible for the greatest amount of content in the itemcontaining the resolution 2411 will be quoted for up to auser-configurable number of sentences. This is also referred to as the“primary author” (Block 2854) (Or, in the event of a transcriptassociated with a meeting event 2112 (Block 2852, 2856), the “primaryspeaker” (Block 2860), as identified by their name preceding a textblock325.) Note that this may not always be the actual author of the item. Inone embodiment, this is expressed in the form: “On <date time> <actor>stated: <sentences.>” (Block 2864) Note that sentences could also besentence fragments or phrases, depending on the actual content.

In one embodiment, the user may modify the template provided by thesystem. If the discussion 305 is revised due to the addition of new datato the corpus, the resolution 2411 will have to be regenerated.

In continuous versions of the system, a discussion 305 may not have aresolution 2411 for the simple reason that it has not yet concluded. Adiscussion is considered to be terminated when there are no furtheritems appearing after an interval of t after the last item. In oneembodiment of the invention, the value of t is calculated according to:

-   -   For actors 310 (a,b . . . z} participating in the discussion        305, if f(a,b) yields the longest interval of warped time        between consecutive communication between actors a and b,        t=2*the largest value of f(a,b) for all pairwise combinations of        actors participating in the discussion 305. In other        embodiments, a different value than 2 may be used.    -   In another embodiment, t may be set to a fixed time interval by        the user; alternately the user may add formulas, or select from        a set provided by the system. For example, 5*(mean time between        contiguous events.)        In the event that the discussion 305 is not yet (considered)        complete, its resolution 2411 will be: “Pending.”        Discussion Partitions

Since discussions 305 may have an arbitrary number of items, ease ofnavigation and readability can become a significant issue. To counteractthese potential difficulties, the system will attempt to partitionlonger discussions 305 into smaller logical segments. In one embodimentof the invention, any discussion 305 containing 30 items or more fallsinto this category; in other embodiments it is user-configurable.

The idea is to create partitions 2401 that have semantic meaning. Events1525 that will trigger the generation (Block 2905) of a new partition2401 include, but are not limited to:

-   -   Entry of one or more new contributors 2404 into the discussion        305.    -   Topic drift (as determined using any of the currently available        analysis packages for this purpose. The new partition marker is        inserted after the drift from one topic 315 to the next is        complete)    -   A burst of communication activity    -   A change in register    -   A trough of all discussion 305-related activity that is equal or        greater in length to the interval of time covered by the items        that would be in the newly created partition 2401 previous to        it.    -   End of a workflow process    -   End of a project (as extracted from a project management system        if available, or calendar.)

Such a trigger will be acted upon unless it would violate the minimumpartition 2401 size, which may be set by the user. Partition 2401triggers occurring in under this limit will be ignored (Block 2910). Apartition 2401 ends where a new partition 2401 begins. The type of theprevious marker does not influence the type selection of the subsequentone. In other embodiment of the invention, a partition 2401 marker isautomatically inserted (Block 2920) after each N items, where the valueof N is user-configurable.

Partitions 2401 are used by the querying engine in order toapproximately identify those portions of the discussion 305 that aremost relevant to the query 320. This comes into play when the topic 315or lexical content specified by the user in a query 320 only reallyoccurs as a passing topic in the midst of a much broader discussion 305.In such cases, the query engine 3120 will identify those partitions 2401which contain the terms or topics 315 in question. This information ismade available for use by user interface components, so that (forexample) the partitions 2401 in question can be highlighted.

Pivotal Items

Often, a particular item in a discussion 305 will cause a sudden shiftin one or more dimensions of a discussion 305. Such items, whetherexternal 2104 or internal events 2102, email messages, or any other itemtype are important to note because by definition they were eitherresponsible for, or at the least correlated to, a substantial impact onactor 310 behavior in the discussion 305. They are for this reason oftenmemorable to the participants 2403 in the discussion 305 long after thefact, an attribute which may be very helpful in certain use cases, suchas depositions. If an actor 310 is consistently generating pivotalitems, this can be considered a measure of their level of influence. Inone embodiment of the system, this is also considered as a measure ofactor 310 “importance.” Thus the system specially identifies them. Inone embodiment, pivotal items 2402 must have occurred within a shorttime interval prior to any of the following:

-   -   Shift in actor 310 presence.    -   Topic 315 drift (as determined using any of the currently        available analysis packages for this purpose. The new partition        2401 marker is inserted after the drift from one topic 315 to        the next is complete)    -   A burst of communication activity    -   A change in register    -   Initiation of collaboration    -   A splitting of the discussion 305

Pivotal items 2402 are identified on a purely empirical basis. FIG. 30is a flowchart of one embodiment of identifying pivotal events. Each ofthe above changes in a discussion 305 suggests, but does not require,the possibility of a pivotal item 2402. In instances where such a changehas occurred, any item occurring in the partition 2401 containing thestart of the change, as well as the first item in the next partition2401 (if the change spans 2 partitions 2401) are candidates to beselected as pivotal message(s) 2402. To determine which item, if any(Block 3005), is to be considered pivotal, all items in both the Nth andthe N+1nth partitions 2401 are analyzed (Block 3010) for commontextblock 325 content. The first chronological item in the Nth partition2401 containing the most commonly occurring (Block 3015, 3020) textblock325 in the N+1nth partition 2401 is considered to be pivotal (Block3040). If no textblock 325 occurs more than once, the same test isapplied on named entities (Block 3025, 3030) other than actors 310 (i.e.document titles, locations, organizations,) and links to web pages.

For example, in the case of an external article being forwarded toseveral of the actors 310 in a discussion 305 prior to a burst incommunication, the initial appearance of the article is the pivotalitem. In the case where there is a sharp register change betweenconsecutive communications, which of the two communications is reallythe pivotal event is determined by the combined number of replies to andforwards specifically of each. In the event that this number isequivalent, the “pivot” is considered to extend over both items.Similarly, in the situation of a burst of communication, the systemseeks the item that is at the root of the greatest number ofcommunications in the burst. This might be a forwarded email, a URL tosomething on the internet, or a textblock 325 extracted from a document505. In any of these events, the system looks for the first occurrenceof the content in question, up until the first item that is clearly partof the burst.

Note that while pivotal items often lie near partitions 2401, this willnot always be the case. The system creates partitions 2401 for purposesof general usability and navigablility. It will do this whether or notthere are pivotal items. Conversely, since partitions 2401 can beconfigured to have a minimum length, in theory a partition 2401 couldcontain more than one pivotal item.

Missing Item Detection & Reconstruction

Real-world data often offers only an incomplete record of whattranspired. For example, a large number of communications might be foundwith “RE: change request” in the title, but most email client programsonly insert the “RE:” into a header when it is a response to a prioremail message. The logical assumption that can be made is that thisthread started with a message called “change request” but that the emailmessage that initiated the thread cannot be located, for whateverreason. (Even if the “re:” were manually inserted by the user, it can beconsidered some indication that the item in question was not originallya singleton.) A quoted textblock 325 that doesn't match any other mailsis also considered evidence of a deleted email message. A reply-to IDthat no longer resolves to an item—or similarly a frequently referencedlink to a document 505 that no longer resolves are other types of itemswhose original presence may easily be inferred from the system. Workflow2409, either ad hoc or formal, also provides compelling evidence thatcertain items must have existed at one point.

The above example represents perhaps the simplest case of how the systemcopes with missing or corrupt data. In order to account for the stronglikelihood that items submitted to analysis by the system will containomissions and/or unparseable data, a mechanism is required for resolvingmissing references or predicting the existence of documents 505 nototherwise located by the system. This is crucial not only in theassembly of items into discussions 305, but also in the capability todetect specific patterns in the deletions of data.

For example, let us take the following scenario: Emails DocumentsMeeting requests A to B, C, D B to A C to A D to A D to B, C, A C writesContract A, B, C, D discuss contract A, B, C, D discuss contractrevision A, B, C, D discuss customer rejection of contract offer

Here, the request to discuss contract revision seems to occur withoutthe contract revisions being located. However, a number of factors seemto indicate that this must have occurred: there was a meeting to discussa contract, and subsequently, a meeting to discuss revisions to it,which was preceded by an email exchange between parties who had been atboth meetings. Using the previously described linguistic processingtechniques, including specific kinds of ontologies, the system attemptsto identify events 1525 as best it can. In the above example, the systemwould be aware of all named documents 505 in the corpus, and with theassistance of a document modification event 540 ontology to recognizesuch stems as “revise” and “rewrite” and phrase templates, the fact thata document 505 revision that appears to have once existed is no longerpresent could be trapped.

A related function of the invention is to actually reconstruct adocument that no longer exists or is no longer accessible (has becomecorrupted, is strong encrypted, etc,) but which has left an evidentiarytrail behind—in those cases in which this is possible to achieve. In oneembodiment of the invention, two main categories of reconstruction areperformed:

-   -   Email. If an email was either part of an email thread or        contained copy/pasted text from another email (as identified in        the textblock 325 matching process,) the email can be largely,        and in some cases totally, reconstructed simply by removing all        content that appears at a depth of zero.    -   Regular documents 515. The best that can be hoped for in this        case is that an email or check-in message involving the        change(s) that corresponded to that document 515 version or        document 515 still exists and is accessible. Specifically, by        “correspond,” we mean any of the following, though it is not        limited to this list:        -   The document 515 was referenced by title, directly or            indirectly, in an email        -   The subsequent step in a workflow 2409 instance notes or            bears witness to the contents of the previous step.        -   It was contained in an attachment to an email, or in a link,            but is no longer accessible        -   It had been in a document repository, but is no longer            accessible            In such cases, the “reconstructed” attribute of the document            505 is set to 1. The reconstructed document 505 content            becomes one of the following:    -   The partial, or full email, up to and including any header        information that still exists.    -   Rump information about a document that no longer exists in any        version. This includes communications as described above,        workflow 2409 information (including the template that would        have been used in the missing document, when this information is        available,) and any check-in messages, fax or other cover sheets        that might have appeared in the same OCR'ed staple set.    -   The immediately previous version of the regular document 515,        plus any rump information.        In the event that document 505 is located later, it will replace        the reconstructed version, and the reconstructed attribute will        revert to 0.        Querying Engine

In one embodiment of the invention, the query engine 3120 accepts queryparameters from a number of different sources, including but not limitedto:

-   -   Query language 3102 (see ‘A Method and Apparatus to Visually        Present Discussions for Data Mining Purposes’)    -   Natural language interface 3104    -   QBE GUI 3110 (see ‘A Method and Apparatus to Visually Present        Discussions for Data Mining Purposes’)    -   Query Building Wizard 3144 (see ‘A Method and Apparatus to        Visually Present Discussions for Data Mining Purposes’)    -   Direct Multi-Evidence Query GUI 3106 (see ‘A Method and        Apparatus to Visually Present Discussions for Data Mining        Purposes’)    -   Canned Query Generator GUI 3112 (see ‘A Method and Apparatus to        Visually Present Discussions for Data Mining Purposes’)    -   Visual Query 3108 Specification (see ‘A Method and Apparatus to        Visually Present Discussions for Data Mining Purposes’)    -   Access API 3114

FIG. 32 is a block diagram of the primary template and return types. Thequery engine 3120 understands the following classes of query 320 andrelated objects: actor query 3205, content query 3210, event query 3215,quantification query 3220, statistical analysis 3225 and anomaly orpattern 3230. The query engine 3120 generates the following returntypes: actors 310 and actor count 3235; discussions 305 and singletons435, and count 3240; event 1525 count and instances 3245; object count3250; output of statistical test 3255; anomaly records and/or anomalousdiscussions 305, and count 3260.

Regardless of the type of the query submitter, the behavior of the queryengine 3120 is the same (see blocks 3120-3142). If it is not explicitlypassed a template (see ‘A Method and Apparatus to Visually PresentDiscussions for Data Mining Purposes’) to match the query 320 to, itwill match the query 320 to the available templates 3118 based on theorder and types of the query 320 terms. Note that the natural languageinterface 3104 includes additional terms that indicate the template3118. For example, the presence of the words “how often” at the start ofa query 320 indicate that it is an event-count query 320.

While most query templates 3118 result in the retrieval of discussions305, this is not universally true. The query language 3102 allows forquerying of anomalies, patterns, and trends. It also allows forstatistical computations, such as asking the system to correlate twokinds of events. Further, transitive queries are also permitted. Forexample, instead of querying to see all of the discussions 305 on topic315 X, the user may query directly to determine the list of actors 310involved in those discussions 305, or who played particular roles inthem (e.g. primary actor 2406, contributor 2404, etc.) Similarly, a usercan query to see all actors 310 who generated content on topic 315 X,without explicitly involving discussions 305.

Query 320 result types include, but are not limited to, the following:

-   -   Discussions 305    -   Actors 310    -   Events    -   Documents 505    -   Anomaly instances    -   Number of Entity (discussions 305, actors 310, events, documents        505, etc)    -   Statistical test result (including but not limited to:        statistical significance, correlation, conditional probability,        variance, etc)

Not every query 320 specified will be a valid one. In the event that noclear template 3118 match can be found, the system will abort the query320 attempt, and prompt the user with structured assistance, (such as awizard in a GUI,) to re-specify the query 320. If a particular termspecified, such as an actor 310 name, cannot be found in the corpus, thesystem will present a list of nearest matches, and request that the userselect the desired one. In embodiments, this is done on the basis ofn-gram analysis, while in another embodiment the system relies onphoneme-oriented technology such as that offered by Soundex. If thisfails as well, the system will prompt the user for the organization orgroup of the desired actor 310, and will provide a list of names tobrowse through accordingly. Other terms are handled similarly. Thesystem will proceed with the query 320 if it has at least one validinstance of each query 320 term required for the particular template3118. If there are also invalid terms present, the system will generatea list of the unrecognized terms with a warning.

Queries that will result in either the retrieval or counting ofdiscussions 305 perform a structured query on the set of discussion 305records. Other types of multiple evidence queries which are notdiscussion-bound may have to access multiple record types. Hence thosequeries directly involving discussions 305 will be the mostcomputationally efficient.

All queries produce an audit trail. In one embodiment of the system,this audit function tallies the number of queries on each entity (forexample, how many times the actor 310 Jay Smith was specified in a userquery 320,) as well as the number of times each object was retrieved inresponse to a user query 320.

Relevancy Ranking Schemes

Because discussion 305 objects have so many different attributes, manybased on entirely different kinds of evidence, a simple or fixedrelevancy ranking scheme would be ineffective, and would lead to userconfusion. Most queries entered into the system are likely to have atleast 3 or 4 distinct dimensions, for example: actor 310, contentdescription, action (i.e. created, modified, received, deleted, etc) andtime frame. In addition, in certain use cases, the most relevantcharacteristic of a discussion 305 might be arbitrary compoundattributes, such as two complex events (i.e. actor 310 performing actionon content) occurring in a particular sequence.

In order to counter these difficulties, the query engine 3120 performs aclustering analysis on the candidate result set (Block 3310). Returnedresults are returned according to cluster, with the members of thehighest ranking cluster being displayed first. The clusters are rankedaccording to how well the canonical member matches the values of mostimportant evidence types specified in the query 320 (Blocks 3315, 3320).Which evidence types are most important are determined by the user. Thismay be done as a global default, or as an advanced option to anindividual query 320. In one embodiment of the invention, this is doneusing a combination of ordinal ranking and a 5 point anchored importancescale ranging as follows:

-   -   5=critically important    -   4=important    -   3=moderately or somewhat important    -   2=relatively unimportant    -   1=unimportant

In one embodiment of the invention, a default setting is provided inwhich actor 310 specification has an ordinal ranking of 1 and animportance of 5, and content specification has an ordinal ranking of 2and an importance of 5. Note that both an ordinal ranking and animportance scale are required; the former is necessary in order to breakranking ties, and the latter is required in order to know how muchrelative weight to assign to each evidence type. Specifically, therelevance ranks for individual items are determined using the importancescale value, multiplying it by the match coefficient (blocks 3325, 3330)and then where necessary, using the ordinal ranking to break ties (Block3340). For example, if a user specifies in a query that it is criticallyimportant that a particular document type appear in a discussion 305,actor 310 information will still take precedence in ranking, presumingthat actor 310 information has the highest ordinal ranking. However, ifthe user that one actor 310 appearing the results is more important thananother particular actor 310 appearing, both will have the same ordinalrank, but will have different importance levels assigned to them.

The match coefficient is used to express the degree to which theattributes of an individual discussion 305 match what the user hasspecified in a query 320. For example, in a query 320 the user may havespecified an actor 310 group which is comprised of 8 members. Aparticular discussion 305 might only be associated with 5 of thoseactors 310, but might strongly match other terms in the query 320. Mostembodiments of the invention take the viewpoint that in this example,the discussion 305 should not receive a score of 0 in the actor 310dimension. In such embodiments, the discussion 305 will thereforereceive a partial score for actors 310. In one such embodiment, thediscussion 305 would receive a score of 5/8. In other embodiments, thepartial score is based on the percentage of overlap between the actors310 occurring in the discussion 305, and those specified in the query320. So if the discussion 305 in question had 10 other actors 310associated with it apart from those 5 named in the query 320, thepartial score would be 5/18. Still other embodiments perform similarcalculations, but factor in the different possible levels of actor 310participation. Another embodiment requires that at least one of theprimary actor(s) 2406 in the discussion 305 appear in the query 320 inorder for a partial score to be assigned. Other embodiments extend thisto contributors 2404 as well. However, the user may specify in the query320 that the presence of a certain actor 310 is required in order toreturn a result. In such instances, this specification takes precedenceover all any partial scoring mechanisms.

Other examples of where match coefficients should be used include, butare not limited to:

Ontology classes: In one embodiment, a parent ontology class may besubstituted for a child, and similarly, one sibling class may beexchanged for another. In one such embodiment, either substitutionresults in a partial score of 0.75.

Document Templates: Similarly, a document 505 in a discussion 305 maynot have been created with the exact template specified in the query320, but rather a specialization, generalization, or sibling. In onesuch embodiment, either substitution results in a partial score of 0.75.

Events: Similar logic applies to events 1525. This is especiallyimportant, since some events 1525 extracted from sources such as onlinecalendars may frequently be underspecified. Further, there may bearbitrarily deep hierarchies of user-created event 1525 types,decreasing the likelihood of exact type matches.

Within each cluster, by default, the results are shown in order ofhighest score relative to the query 320. Singletons 435 can also appearin results sets, but by default will do so only after the list ofdiscussions 305; in some embodiments the singletons 435 (when present)are placed in a separate result set, and displayed separately.Singletons 435 are relevance ranked using standard lexical relevancyranking techniques.

In one embodiment, the user may specify a query 320 that would result ina specific set of discussions 305 being returned, and in this same query320 specify that he wishes to see any anomalies related to this set ofdiscussions 305. This will result in any anomaly containing discussions305 being brought to the top of the results list, regardless of otherscoring. The relevant anomaly records will also be returned.

Some embodiments of the invention may also relevancy rank according tofactors including, but not limited to:

-   -   The length of the discussion 305 in terms of item count    -   The maximum depth of the discussion 305    -   Whether it achieved resolution; incomplete discussions 305 are        valued less highly    -   The number of times a document 505 or communication within the        discussion 305 has been forwarded to other actors 310,        forwarding often being an indicator of relevance.    -   In e-mail reply chains, the value of the X-importance flag.    -   The number of actors 310 to which a communication was sent, and        the identity and importance or circle of trust 1505 or        organizational membership of those actors 310.        Querying by Example

The invention allows various forms of Query by Example. It allows theuser to submit a discussion 305 as an exemplar, using the scoringmechanisms discussed above. It also allows one or more singletons 435 tobe so submitted. In one embodiment, all exogenous documents 505 of thesame type are processed in a batch so that similarities and overlap maybe noted. (Block 3512) These documents 505 can be part of the corpus, orany arbitrary document outside of the corpus. One highly useful use caseof this is loading depositions into the system. Such especially usefuldocument types may have special interpretative templates created forthem (Blocks 3502, 3504), as well as special results output behavior.For example, in the case of a deposition, the structure consists of asequence of question blocks followed by answer blocks. The identity ofthe deponent can be easily be identified from the document format,instead of having to rely on the first known actor reference located inthe document (Block 3508.). This information can be leveraged to parsethe question, and submit the block comprising the question and answeralong with the deponent actor substituted for the pronoun in thequestion (block 3510) to the QBE processor (Block 3430). It can also beused to group results very effectively, for example grouping togetherand comparing the responses of multiple deponents to the same question.More generally, if there is no template available for the document inquestion, pragmatic tagging will be used to separate questions fromanswers (block 3506) the system passes the document 505 through itsontological filters in order to ascertain references to specific topics315 of interest in the corpus (block 3506). The QBE processor performs afull linguistics analysis (block 3514) and extracts any named entities,such as actors 310 or document titles (block 3506.) Then it extractsdates, as well as geographic locations (Block 3435). Doing a segment bysegment analysis, where the segments are first sentences and thenindividual textblocks 325, the actor 310, action, content descriptor,time frame, and potentially other data, such as event information areextracted, and translated into queries in any case in which minimally anactor 310, an action, and a content descriptor have been identified.

Depositions and similar question and answer formatted documents 505 areof special interest for commercial reasons. In particular, in thoseinstances where a large number of deponents are asked the same (or verysimilar once post-processed) questions, the invention can be used tocompare and contrast the different answers provided (blocks 3520, 3518).

-   -   Perform an n-gram analysis of the different answers to the same        question.        -   Within each set of answers to a given question, group            similar answers based on shared n-grams and shared Named            entities (block 3522.) This is an unsupervised text            clustering within one supervised cluster.        -   Using pragmatic tags 2301 to separate negative from positive            responses, further categorize the responses (see block            3524.) Count up and correlate the actors 310 providing            similar responses with the circle of trust 1505 information            computed from the corpus (block 3524.)        -   Aggregate the clusters of Actors 310 over all answer sets.            Again look for intersections with circles of trust 1505            computed from the corpus (block 3526)        -   If there is a pattern of unusually common agreement among            actors 310 who are members of the same circle of trust 1505            as opposed to arbitrary other deponents, the system            generates a message containing the names of the actors 310,            the question, and the most canonical example of a response.            (For the preceding bullet point see blocks 3528, 3530, 3532,            3534.)            Statistics and Anomalous Behavior

In order to help build discussions 305, the system calculates thepatterns of actor 310 behavior. However, the flip side of calculatingthe patterns is determining the anomalies, or exceptions to thesegenerally practiced patterns of behavior. Such anomalies are animportant facet of the communication of the groups of actors 310 in thecorpus. For example, sometimes the fact that a particular actor 310 wasmanually eliminated from a “reply all” is more interesting than thecontents of the email in question. To this end, the system calculatesthese anomalies, and allows them to be queried by the user. Anomaliesare properties of the actors 310 committing them, however a discussion305 may be flagged as anomalous if it contains anomalous behavior. Inone embodiment of the invention, anomalies are determined as follows.

The system calculates the proportion of messages any individual actor1210 addresses to any other actor 310 or group of actors 310 overdiscrete time intervals. Any change in these numbers or proportions mayindicate a significant event such as a change of role for the actor 310in question, the start or end of a project, or the start or end of someactivity that the system is attempting to detect. Note that in thoseinstances where the detected anomaly matches on actor 310 and time witha well-known calendar event 2108, such as the end of a project, noanomaly will be reported. Similarly, the system detects other changes inbehavior, or anomalous behavior, including, but not limited to:

-   -   An actor 310 starting to cc or bcc himself on certain e-mail        messages.    -   An actor 310 starting to forward messages or documents 505 to        another of his own accounts, or to another actor 310, either of        which may be inside or outside the organization.    -   Situations where Actor A organizationally reports to Actor B,        and Actor B reports to Actor C, but A and C communicate with        each other frequently, excluding Actor B. Similarly, any        instances in which a circle of trust 1505 exists that does not        conform to the organizational chart (in which links exist in the        circle of trust 1505 that do not exist in the graph representing        the organization chart.) Or any other instance in which the flow        of communication is not aligned with the org chart, either        skipping over actors 310, or violating organizational        boundaries. This includes, but is not limited to, patterns cases        such as the one in which Actors A-E all report to Actor Z, but A        only regularly communicates with B, C, and D. This information        may optionally be correlated to other actor 310 information that        may be available, such as age, race, gender, and religious or        sexual orientation. Note that in addition to organization        charts, other information such as project membership lists may        be fed into the system in order to avoid false positives caused        by actors 310 communicating extensively within a project        spanning multiple organizations. Note that the system constructs        different versions of an organization chart graph by extracting        the actor lifecycle events 545 from an HR system, and then        building as many versions of the resulting graph as necessary.        (No more than one new version per calendar day can be generated.        This restriction is to account for bulk events, such as the        transfer of 20 persons from one organization to another.)    -   Situations where the actors 310 discussing a particular subject        are not those who would normally be expected to discuss that        subject. This is detected by finding isolated islands of topic        315 communication, as determined by the use of ontology classes        or statistical topic analysis. For example, in the accounting        department, there will typically be many finance-related        discussions 305. The saturation level of these topics 315 in the        discussions 305 in which the actors 310 in this department        participate can therefore be expected to be high. However, for        example, in the engineering department, discussing such topics        315 is unusual, again as measured by the saturation level of the        topic 315 in the discussions 305 in which actors 310 in that        department are engaged.    -   Chains of trust: patterns where whenever Actor X communicates        with Actor Y, Actor Y always communicates with Actor Z        immediately afterwards (or in the first bunch of e-mails        thereafter, to allow for Actor Y returning from an        absence—so-called “warped” time). Likewise, situations where        this pattern initially exists, but then suddenly ceases to        exist. This is determined by walking the colored graph for        common patterns of sequential actor 310 communications. Those        not corresponding to a known workflow process are flagged; the        subset of these in which an organizational boundary of some kind        (even a hierarchical or lateral one) has been crossed are        flagged as anomalous. Note that in this regard, the        organizational boundary is determined by assessing the primary        work domain of each actor 310. In other words, in this        particular calculation, an actor 310 making use of another        actor's 310 home email account will not be considered crossing        an organizational boundary if the two actors 310 are in the same        organization.    -   Instances in which an actor 310 who does not normally edit a        specific kind of document 505 (determined by either topic 315,        file extension, template, or a combination of the above,) or        instance of a document 505 doing so. Further, instances in which        such edit appears to have been requested by another actor 310,        as identified by a combination of pragmatic tagging and named        entity extraction (i.e. the document 505 is referenced by name        in a communication,) or by clear reference to the document 505        as attachment to the present or related communication.    -   Changes in “warped time” for more than 50% of the actors 310 who        are either members of the same circle of trust 1505, aggregate        actor 1205, or organizational group. Otherwise, put,        statistically significant increase in the amount of after-hours        communication and other electronic activity.    -   Similarly, time intervals providing evidence of the bursty        behavior used to establish event-based circles of trust 1520 are        flagged as anomalies.

In addition, the system keeps a count of each such instance per actorID, as well as the impacted items. These items can be mapped to theircontaining discussions, thus allowing discussions containing anomalousitems to be identified. It can thus also identity any actor 310 whoperforms these unusual activities a significant amount on a long-termongoing basis. Such an actor 310 would avoid detection based on onlyexamination of only his own behavior. However, when contrasted with therest of the population, the behavior will appear appropriatelyanomalous.

The system will endeavor to correlate all anomalies that occur more thanonce to any periodically or frequently occurring event 1525. The timethreshold used may be set by the user; by default, if L is the temporallength of the event, an anomaly would have to occur within |L| of theevent 1525 in order to be considered related.

As noted in a previous section, in general, abrupt or gradual changes inthe interactions among specific actors 310 are not considered anomalousper se. However pairwise histogram information of actor 310communication is calculated during the computation of circles of trust1505, and such changes are queryable through the querying engine.

Forensic Vs. Continuous Systems

The invention herein described can be used in either a forensic context,or on a “live,” continuous, or “incremental” monitoring basis. The maindifference is that in the case of the latter, the interpretation of someitems may have to be deferred until there is sufficient data todetermine how to proceed. From a procedural standpoint, this amounts tothe same thing as adding new data sets in the forensic use case.However, in the continuous case, discussions 305 which currently do notyet have resolutions marked with a special “pending” flag until thediscussion 305 is considered to have concluded based on a long period ofinactivity (see previous discussion of this.) In one embodiment,singleton items 435 that were created by actors 310 indigenous to thecorpus are similarly marked as pending for a certain period of timebased on the mean time to the next expected related event, if there isto be one. For example, if the item is an email from Actor A to Actor B,the system will use the communication profile history of these actors310 to determine how long to wait before no longer expecting a response.For other types of items, the mean time to the next related event iscalculated on the basis of past historical mean times between events ofthis type in a discussion 305 in which the same actor 310 was involved,and the subsequent event. In one embodiment, after an interval of 3*thas passed, where t is the mean time of the next related eventoccurring, the pending flag will be removed. Other embodiments may use adifferent constant.

Note that even discussions 305 that were not labeled as pending may yetchange, as “new” data retroactive data becomes available. Examples ofcases this might occur include the case of an actor 310 synching amobile device with a networked computer after several days of use, anactor 310 who had been on vacation without internet access suddenlygetting back online and sending numerous communication—or a humanresources database being updated after numerous changes had alreadytaken effect. Therefore, the pending flag should be thought of as aprobabilistic guess on the part of the system based on historicalbehavior, rather than as an absolute guarantee of stability ofnon-pending items. If a non-pending discussion 305 is changed, itsversion number is increased by 1. Pending items do not have versionnumbers. Discussions 305 which are considered to be complete by thesystem may be data warehoused or moved into other types of systems forexaminations.

An appropriate processing interval is determined by the system'sadministrator. If desired, different processing intervals may be set fordata of different types, or in different locations. The most appropriateinterval length is determined by a number of different factors,including how frequently new data arrives, and the particularapplication of the system. For example, in one embodiment, as the datais processed, an alert is sent to a pre-configured list of users ifcertain ontology classes trap data (for example, child pornography,) orfor different kinds of anomalies, such as information leakage. A log ofall such occurrences is also maintained by the system. In such cases, afrequent processing cycle is required.

FIG. 36 a is a block diagram of the dynamic update detection process.The system uses polling or subscriptions to monitor file system changes(Block 3604), application store changes (Block 3608) and changes fromother sources including the network 105 (Block 3606). Files 3602 arepassed as input to the spider, adding file location and timestamp to thespider database, if not already there (Block 3610). Duplicates aredetected (Block 3612), files are split into communications (Block 3614),a communications fingerprint is computed (Block 3616), and thefingerprint is associated with a spider database entry (Block 3618). Ifthe fingerprint has not been seen before (Block 3620), indexing 3622(possibly incremental) is required (Block 3622).

FIG. 36 b is a block diagram of the indexing process 3622. Fulltextindexing 3623 of free text and metadata (Block 3624) allows a graphindex to be generated (Block 3626) and discussions 305 to be built(Block 3628). Constrained clustering is conducted (Block 3634), rulesare applied (Block 3636), local search is carried out (Block 3638) andactor 310 aliases are deduplicated (Block 3640).

Privacy Issues

Access Control Lists (ACLs) may be set up on the basis of actor 310(individual or aggregate,) topic 315, the presence of specific documenttypes (as determined by pragmatic tags, the presence of specific wordsor phrases, or file extension or name,) or the intersection or union ofany of these attributes. In one embodiment, the ACLs used are those ofthe underlying operating system, so that the present invention inheritsthe security model of the computing environment in which it is running,and allows the reviewer 2407 access only to those items to which thereviewer 2407 would normally have access using access methods other thanthe present invention. The system may be configured to not show anydiscussion 305 in which even one protected item appears. If this optionis not enabled, the discussion 305 items will be automaticallyrenumbered by the system prior to display to the user so as to mask theexistence of the protected item. Note that the numbers are mapped backto the correct underlying numbers if the user makes a modification, suchas an annotation. An option also exists to exclude from view anypersonal communications. This is judged by correspondence with, ordocuments 505 from an actor 310 outside the primary domains of thecorpus who is manifested in the corpus only in relation to a singleactor 310. Similarly, items that would be considered as “privileged” ina court of law may be automatically excluded from view for everyoneexcept the superuser, who has access to all data.

Conclusion

An apparatus for processing a corpus of electronic data associated witha finite (though potentially very large) set of actors into sets ofcausally related documents or records which will be referred to here as‘discussions’ 305. This contextualization of the data makes for mucheasier, faster, and less ambiguous reading of a related series ofcommunications, events, or documents. Moreover, it greatly reduces thenumber of false positives returned to the user who submits queries,thereby virtually eliminating the problem (commonly encountered withkeyword search) of huge results lists containing just a few items ofinterest. It additionally facilitates the identification of patterns ofbehavior among the actors associated with the corpus. This can be usedboth to carry out informal process assessment, and to detect anomaliesin these patterns that may be important in an investigative orlitigation context.

By managing documents within their sociological context, the presentinvention allows the user to understand the when, who, how, and why ofthe document, and the document's relationship to other documents.Specifically:

Who: The system's notion of “who” includes, but is not limited to thefollowing: Is the document an attachment, or inline text in an emailmessage; Who is the author of the document; If the document is anattachment, is its author different than the author of the email; Was itforwarded from another source; Who was it sent to; Was it sent toindividuals, or to lists; Who was it cc'ed to; Bcc'ed to; Was there a“reply to:” different from the authoring account; Does the exchange fanout over time to include more people, or does it contract; Who hasdeleted the message; Saved the attachment; Did the sender delete itespecially from his “sent” box; Did he specifically cc: or bcc: to ithimself, (indicating that he felt it had a certain importance when hesent it.) Who has retrieved or examined the discussion;

When: The system's notion of “when” includes, but is not limited to thefollowing: When was the document first sent; What time of day; Is thetime of day consistent with other messages classified as being in thediscussion; What day of the week; Was it sent more than once; By thesame person, or also by other people—and if so, whom; If an attachment,when did it first appear in an email; Has it been updated or revised; Ifso, how many times.

Where: The “where” is a means of capturing personal context. Personalcontext is extremely helpful in assisting someone to remember the rangeof dates or approximate time span in which the discussion may haveinitiated, intensified, or concluded. For example, someone mightremember “The conversation really heated up when I was in France.”(Further, if the trip information is entered into a calendar, the systemcan retrieve the date range automatically.)

“Where” includes the answers to questions such as: From what account wasthe document sent, a corporate or a personal one; From what time zone;(Was the sender traveling at the time); From what kind of device. In thecontext of the “where” part of the equation, the type of device isrelevant in that is a hint about the user's physical location at thetime he sent the message. For example, if he is at his desk, he isunlikely to use a mobile device.

How: The system's notion of “how” includes, but is not limited to thefollowing: Was the information sent via email; With what priority; Viaan instant message; Via a text page; From what kind of device; Via anattachment in an email; If an attachment, was text from the attacheddocument copied inline in the email itself; Does the attachment appearto be summarized in the email.

The type of device, or method of sending the message may alter certainaspects of the communication. For example, an email message sent from asmall mobile device is likely to be shorter than one sent from a desktopmachine, and may also be likelier to reference a previous or forthcomingmessage. A response from a mobile device, especially a longer one, maysuggest time criticality of the discussion.

Why: The system's notion of “why” includes, but is not limited to thefollowing: in response to a previous message, as the result of a knownmeeting—one time, frequent or periodic, which might be recorded in anOutlook or similar online calendar; as part of a periodic series ofmessages, such as a weekly status report; to send out an updated versionof an attachment; as part of a repeatable or known business process.

When taken collectively, the above-described data allows the document tobe placed in its real world context and subsequently classifiedappropriately on this basis. The result is what is referred to here as adiscussion 305.

In the foregoing specification, the invention has been described withreference to specific exemplary embodiments thereof. It will, however,be evident that various modifications and changes may be made theretowithout departing from the broader spirit and scope of the invention asset forth in the appended claims. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense.

1. A processing system for creating metadata to identify interrelateddocuments, comprising: a metadata repository for storing a plurality ofmetadata elements to represent relations between a plurality ofdocuments; a sociological analysis engine to identify relationshipsbetween the documents using the metadata elements from the metadatarepository.
 2. The processing system recited in claim 1, wherein themetadata includes an actor.
 3. The processing system of claim 2, whereinthe sociological analysis engine automatically resolves references to anactor who has more than one electronic identity.
 4. The processingsystem recited in claim 2, wherein the sociological analysis engineautomatically resolves references to an actor who has more than oneelectronic personality.
 5. The processing system recited in claim 1,wherein the sociological analysis engine detects missing data byreference to other documents within the corpus.
 6. The processing systemrecited in claim 1, wherein the sociological analysis engine removes orotherwise makes unavailable documents or parts thereof deemed by theengine to be of a particular content type.
 7. The processing systemrecited in claim 1, wherein the sociological analysis engine categorizesa plurality of documents by iteratively attempting to match thedocuments to multiple ontology classes, both individually and incombination.
 8. The processing system recited in claim 1, wherein thesociological analysis engine determines the most likely correct term ina document that has been input using a process that is subject to error.9-10. (canceled)
 11. A method of enabling document productioncomprising: a document repository for storing a plurality of documents;a metadata repository for storing a plurality of metadata elements torepresent relations between the a plurality of documents; a sociologicalanalysis engine to identify relationships between the documents usingthe metadata elements from the metadata repository; and a redaction toolto remove or otherwise makes unavailable documents or parts thereofdeemed by the engine to be of a particular content type.
 12. The methodof claim 11, wherein the content type is attorney-client protecteddocuments.
 13. The method of claim 11, wherein the redaction toolcreates a copy of any document having partial redactions, and ensuresthat the redacted data cannot be accessed from the document copy.